Memory Error when training Multinomial Naive Bayes model - python

The training portion of my code can handle data in the order of magnitude of 10^4 but given that my whole dataset consists of ~500,000 comments I would like to train it with much more data. I seem to run out of memory when running the trainer with 100,000 reviews.
My get_features function seems to be the culprit.
data = get_data(limit=size)
data = clean_data(data)
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = []
for i in nltk.FreqDist(all_words).most_common(3000):
word_features.append(i[0])
random.shuffle(data)
def get_features(comment):
features = {}
for word in word_features:
features[word] = (word in set(comment)) # error here
return features
# I can do it myself like this:
feature_set = [(get_features(comment), category) for
(comment, category) in data]
# Or use nltk's Lazy Map implementation which arguable does the same thing:
# feature_set = nltk.classify.apply_features(get_features, data, labeled=True)
Running this for 100,000 reviews eats up all of my 32GB of RAM and eventually crashes with a Memory Error, at the features[word] = (word in set(comment)) line.
What can I do to alleviate this problem?
EDIT: I have significantly reduced the number of features: I now use only the top 3000 most common words as features - this has significantly improved performance (for obvious reasons). I also corrected a small mistake pointed out by #Marat.

Disclaimer: this code has many potential flaws so I expect few iterations to get to the root cause.
Parameters mismatch:
# defined with one parameter
def get_features(comment):
...
# called with two
... get_features(comment, word_features), ...
Suboptimal word lookup:
# set(comment) executed on every iteration
for word in word_features:
features[word] = (word in set(comment))
# can be transformed into something like:
word_set = set(comment)
for word in word_features:
features[word] = word in word_set
# if typical comment length is < 30, list lookup is faster
for word in word_features:
features[word] = word in comment
Suboptimal feature computing:
# it is cheaper to set few positives than to check all word_features
# also MUCH more memory efficient
from collections import defaultdict
...
def get_features(comment):
features = defaultdict(bool)
for word in comment:
features[word] = True
return features
Suboptimal feature storage:
# numpy array is much more efficient than a list of dicts
# .. and with pandas on top it's even nicer:
import pandas as pd
...
feature_set = pd.DataFrame(
({word: True for word in comment}
for (comment, _) in data),
columns = word_features
).fillna(False)
feature_set['category'] = [category for (_, category) in data]

Related

Solving memory issues when using Gensim LDA Multicore

For my project I am trying to use unsupervised learning to identify different topics from application descriptions, but I am running into a strange problem. Firstly, I have 3 different datasets, one with 15k documents another with 50k documents and last with 2m documents. I am trying to test models with different number of topics (k) ranging from 5 to 100 with a step size of 5. This is in order to check which k results in the best model assessed with initially with the highest coherence score. For each k, I also build 3 different models with chunksize 10, 100 and 1000.
So now moving onto the problem I am having. Obviously my own machine is too slow and does not have enough cores for this kind of computation hence I am using my university's server. The problem here is my program seems to be consuming too much memory and I am unsure of the reason. I already made some adjustments such that the corpus is not loaded entirely to memory (or atleast I think I did). The dataset with 50k entries already at iteration k=50 (so halfway) seems to have consumed the alloted 100GB of memory, which seems very huge.
I would appreciate any help in the right direction and thanks for taking the time to look at this. Below is the code from my topic_modelling.py file. Comments on the file are a bit outdated, sorry about that.
class MyCorpus:
texts: list
dictionary: dict
def __init__(self, descriptions, dictionary):
self.texts = descriptions
self.dictionary = dictionary
def __iter__(self):
for line in self.texts:
try:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line)
except StopIteration:
pass
# Function given a dataframe creates a dictionary and corupus
# These are used to create an LDA model. Here we automatically use the Descriptionb column
# from each dataframe
def create_dict_and_corpus(df):
text_descriptions = remove_characters_and_create_list(df, 'Description')
# print(text_descriptions)
dictionary = gensim.corpora.Dictionary(text_descriptions)
corpus = MyCorpus(text_descriptions, dictionary)
return text_descriptions, dictionary, corpus
# Given a dataframe remove and a column name in the data frame, extract all words and return a list
# Also to remove all chracters that are not alphanumeric or spaces
def remove_characters_and_create_list(df, column_name, split=True):
df[column_name] = df[column_name].astype(str)
texts = []
for x in range(df[column_name].size):
current_string = df[column_name][x]
filtered_string = re.sub(r'[^A-Za-z0-9 ]+', '', current_string)
if split:
texts.append(filtered_string.split())
else:
texts.append(filtered_string)
return texts
# This function given the parameters creates an LDA model for each number between
# the start limit and the end limit. After this the coherence and perplexity is calulated
# for each of those models and saved in a csv file to analyze later.
def test_lda_models(text, corpus, dictionary, start_limit, end_limit, path):
results = []
print("============Starting topic modelling============")
for k in range(start_limit, end_limit+1, 5):
for p in range(1, 4):
chunk = pow(10, p)
t0 = time.time()
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=chunk)
# To calculate the goodness of the model
perplexity = lda_model.bound(corpus)
coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
t1 = time.time()
print(f"=====Done K={k} model with passes={p} and chunksize={chunk}, took {t1-t0} seconds=====")
results.append((k, chunk, coherence_lda, perplexity))
# Storing teh results in a csv file except the actual lda model (this would not make sense)
path = make_dir_if_not_exists(path)
list_tuples_to_csv(results, ['#OfTopics', 'ChunkSize', 'CoherenceScore', 'Perplexity'], f"{path}/K={start_limit}to{end_limit}.csv")
return results
# Function plot the visualization of an LDA model. This visualization is then
# saved as an html file inside the given path
def single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path):
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, f"{path}/visualization.html")
# Given the results produced by test_lda_models, loop though the models and save the
# topic words of each model and the visualization of the topics in the given path
def save_lda_result(k, c, lda_model, corpus, dictionary, path):
list_tuples_to_csv(lda_model.print_topics(num_topics=k), ['Topic#', 'Associated Words'], f"{path}/associated_words.csv")
single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path)
# This is the entire pipeline that needs to be performed for a single dataset,
# which includes computing the LDA models from start to end limit and calculating
# and saving the topic words and visual graphs for the top n topics with the highest
# coherence score.
def perform_topic_modelling_single_df(df, start_limit, end_limit, path):
# Extracting the necessary data required for LDA model computation
text_descriptions,dictionary, corpus = create_dict_and_corpus(df)
results_lda = test_lda_models(text_descriptions, corpus, dictionary, start_limit, end_limit, path)
# Sorting the results based on the 2nd tuple value returned which is 'coherence'
results_lda.sort(key=lambda x:x[2],reverse=True)
# Getting the top 5 results to save pass to save_lda_results function
results = results_lda[:5]
corpus_for_saving = [dictionary.doc2bow(text) for text in text_descriptions]
texts = remove_characters_and_create_list(df, 'Description', split=False)
# Perfrom application to topic modelling for the best lda model based on the
# coherence score (TODO maybe test with other lda models?)
print("getting descriptions for csv")
for k, c, _, _ in results:
dir_path = make_dir_if_not_exists(f"{path}/k={k}_chunk={c}")
p = int(math.log10(c))
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=c)
print(f"=====REDOING K={k} model with passes={p} and chunksize={c}=====")
save_lda_result(k,c, lda_model, corpus_for_saving, dictionary, dir_path)
application_to_topic_modelling(df, k, c, lda_model, corpus_for_saving, texts, dir_path)
# Performs the whole topic modelling pipeline taking different genre data sets
# and the entire dataset as a whole
def perform_topic_modelling_pipeline(path_ex):
# entire_df = pd.read_csv("../data/preprocessed_data/preprocessed_10000_trial.csv")
entire_df = pd.read_csv(os.path.join(ROOT_DIR, f"data/preprocessed_data/preprocessedData_{path_ex}.csv"))
print("size of df")
print(entire_df.shape)
# For entire df go from start limit to ngenres to find best LDA model
nGenres = row_counter(os.path.join(ROOT_DIR, f"data/genre_wise_data/data{path_ex}/genre_frequency.csv"))
nGenres_rounded = math.ceil(nGenres / 5) * 5
print(f"Original number of genres should be {nGenres}, but we are rounding to {nGenres_rounded}")
path = make_dir_if_not_exists(os.path.join(ROOT_DIR, f"results/data{path_ex}/aall_data"))
perform_topic_modelling_single_df(entire_df, 5, 100, path)

joblib results vary wildly depending on return value

I have to analyse a large text dataset using Spacy. the dataset contains about 120000 records with a typical text length of about 1000 words. Lemmatizing the text takes quite some time so I looked for methods to reduce that time. This arcicle describes how to speed up the computations using joblib. That works reasonably well: 16 cores reduce the CPU time with a factor of 10, the hyperthreads reduce it with an extra 7%.
Recently I realized that I wanted to compute similarities between docs and probably more analyses with docs later on. So I decided to generate a Spacy document instance (<class 'spacy.tokens.doc.Doc'>) for all documents and use that for analyses (lemmatizing, vectorizing, and probably more) later on. This is where the trouble started.
The analyses of the parallel lemmatizer take place in the function below:
def lemmatize_pipe(doc):
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return lemma_list
(the full demo code can be found at the end of the post). All I have to do is returning doc instead of lemma_list and I'm ready. I thought.
def lemmatize_pipe(doc):
return doc
The sequential version runs in 73 seconds, the parallel version returning lemma_list takes 7 seconds while the version returning doc runs in 127 seconds: twice as much as the sequential version. Full code below.
import time
import pandas as pd
from joblib import Parallel, delayed
import gensim.downloader as api
import spacy
from pdb import set_trace as breakpoint
# Initialize spacy with the small english language model
nlp = spacy.load('en', disable=['parser', 'ner', 'tagger'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
# Import the dataset and get the text
dataset = api.load("text8")
data = [d for d in dataset]
doc_requested = False
print(len(data), 'documents in original data')
df_data = pd.DataFrame(columns=['content'])
df_data['content'] = df_data['content'].astype(str)
# Content is a list of words, convert is to strings
for doc in data:
sentence = ' '.join([word for word in doc])
df_data.loc[len(df_data)] = [sentence]
### === Sequential processing ===
def lemmatize(text):
doc = nlp(text)
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return doc if doc_requested else lemma_list
cpu = time.time()
df_data['sequential'] = df_data['content'].apply(lemmatize)
print('\nSequential processing in {:.0f} seconds'.format(time.time() - cpu))
df_data.head(3)
### === Parallel processing ===
def lemmatize_pipe(doc):
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return doc if doc_requested else lemma_list
def chunker(iterable, total_length, chunksize):
return (iterable[pos: pos + chunksize] for pos in range(0, total_length,
chunksize))
def process_chunk(texts):
preproc_pipe = []
for doc in nlp.pipe(texts, batch_size=20):
preproc_pipe.append(lemmatize_pipe(doc))
return preproc_pipe
def preprocess_parallel(data, chunksize):
executor = Parallel(n_jobs=31, backend='multiprocessing', prefer="processes")
do = delayed(process_chunk)
tasks = (do(chunk) for chunk in chunker(data, len(data), chunksize=chunksize))
result = executor(tasks)
flattened = [item for sublist in result for item in sublist]
return flattened
cpu = time.time()
df_data['parallel'] = preprocess_parallel(df_data['content'], chunksize=1)
print('\nParallel processing in {:.0f} seconds'.format(time.time() - cpu))
I have searched and tried all kind of things but could not find a solution. In the end I have decided to compute the similarities together with the lemma's, but that is a workaround. What actually is the cause of the time increase? And is there a way to get the docs without losing that much time?
A pickled doc is quite large and contains a lot of data that isn't needed to reconstruct the doc itself, including the entire model vocab. Using doc.to_bytes() will be a major improvement, and you can improve it a bit more by using exclude to exclude data that you don't need, like doc.tensor:
data = doc.to_bytes(exclude=["tensor"])
...
reloaded_doc = Doc(nlp.vocab)
reloaded_doc.from_bytes(data)
To compare:
doc = nlp("test")
len(pickle.dumps(doc)) # 1749721
len(pickle.dumps(doc.to_bytes())) # 750
len(pickle.dumps(doc.to_bytes(exclude=["tensor"]))) # 316
You can also use doc.to_array() instead of doc.to_bytes() to export only the annotation layers that you need, but reloading the doc from the array is slightly more complicated.
See:
https://spacy.io/usage/saving-loading#docs
https://spacy.io/api/doc#serialization-fields

6 GB RAM Fails in Vectorizing text using Word2Vec

I'm trying to do one basic tweet sentiment analysis using word2vec and tfidf-score on a dataset consisting of 1,6M tweets but my 6 GB Gforce-Nvidia fails to do so. since this is my first practice project relating machine learning I'm wondering what I'm doing wrong because dataset is all text it shouldn't take this much RAM which makes my laptop froze in tweet2vec function or giving Memory Error in scaling part. below is part of my code that everything collapses.
the last thing is that I've tried with up to 1M data and it worked! so I'm curious what causes the problem
# --------------- calculating word weight for using later in word2vec model & bringing words together ---------------
def word_weight(data):
vectorizer = TfidfVectorizer(sublinear_tf=True, use_idf=True)
d = dict()
for index in tqdm(data, total=len(data), desc='Assigning weight to words'):
# --------- try except caches the empty indexes ----------
try:
matrix = vectorizer.fit_transform([w for w in index])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
d.update(tfidf)
except ValueError:
continue
print("every word has weight now\n"
"--------------------------------------")
return d
# ------------------- bringing tokens with weight to recreate tweets ----------------
def tweet2vec(tokens, size, tfidf):
count = 0
for index in tqdm(tokens, total=len(tokens), desc='creating sentence vectors'):
# ---------- size is the dimension of word2vec model (200) ---------------
vec = np.zeros(size)
for word in index:
try:
vec += model[word] * tfidf[word]
except KeyError:
continue
tokens[count] = vec.tolist()
count += 1
print("tweet vectors are ready for scaling for ML algorithm\n"
"-------------------------------------------------")
return tokens
dataset = read_dataset('training.csv', ['target', 't_id', 'created_at', 'query', 'user', 'text'])
dataset = delete_unwanted_col(dataset, ['t_id', 'created_at', 'query', 'user'])
dataset_token = [pre_process(t) for t in tqdm(map(lambda t: t, dataset['text']),
desc='cleaning text', total=len(dataset['text']))]
print('pre_process completed, list of tweet tokens is returned\n'
'--------------------------------------------------------')
X = np.array(tweet2vec(dataset_token, 200, word_weight(dataset_token)))
print('scaling vectors ...')
X_scaled = scale(X)
print('features scaled!')
the data given to word_weight function is a (1599999, 200) shaped list which each index is consisted of pre-processed tweet tokens.
I appreciate your time and answer in advance and of course I'm glad to hear better approaches for handling big datasets
If I understood correctly, it works with 1M tweets, but fails with 1.6M tweets? So you know the code is correct.
If the GPU is running out of memory when you think it shouldn't, it may be holding on from a previous process. Use nvidia-smi to check what processes are using the GPU, and how much memory. If (before you run your code) you spot python processes in there holding a big chunk, it could be a crashed process, or a Jupyter window still open, etc.
I find it useful to watch nvidia-smi (not sure if there is a windows equivalent), to see how GPU memory changes as training progresses. Normally a chunk is reserved at the start, and then it stays fairly constant. If you see it rising linearly, something could be wrong with the code (are you re-loading the model on each iteration, something like that?).
my problem was solved when i changed the code (tweet2vec function) to this
(w is word weight)
def tweet2vec(tokens, size, tfidf):
# ------------- size is the dimension of word2vec model (200) ---------------
vec = np.zeros(size).reshape(1, size)
count = 0
for word in tokens:
try:
vec += model[word] * tfidf[word]
count += 1
except KeyError:
continue
if count != 0:
vec /= count
return vec
X = np.concatenate([tweet2vec(token, 200, w) for token in tqdm(map(lambda token: token, dataset_token),
desc='creating tweet vectors',
total=len(dataset_token))]
)
I have no idea why!!!!

Python nltk classify with large feature set (Replicate Go Et Al 2009)

I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?
Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

I am using the Gensim HDP module on a set of documents.
>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> len(corpusA)
1113
>>> len(corpusB)
17
Why is the number of topics independent of corpus length?
#Aaron's code above is broken due to gensim API changes. I rewrote and simplified it as follows. Works as of June 2017 with gensim v2.1.0
import pandas as pd
def topic_prob_extractor(gensim_hdp):
shown_topics = gensim_hdp.show_topics(num_topics=-1, formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})
#Aron's and #Roko Mijic's approaches neglect the fact that the function show_topics returns by default the top 20 words of each topic only. If one returns all the words that compose a topic, all the approximated topic probabilities in that case will be 1 (or 0.999999). I experimented with the following code, which is an adaptation of #Roko Mijic's:
def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
"""
Input the gensim model to get the rough topics' probabilities
"""
shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
if (isSorted):
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
else:
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});
A better, yet I'm not sure if 100% valid, approach is the one mentioned here. You can get the topics' true weights (alpha vector) of the HDP model as:
alpha = hdpModel.hdp_to_lda()[0];
Examining the topics' equivalent alpha values is more logical than tallying up the weights of the first 20 words of each topic to approximate its probability of usage in the data.
There is apparently a bug in Gensim(version 3.8.3), in which giving -1 to show_topics doesn't return anything at all. So I have tweaked the answers by Roko Mijic and aaron.
def topic_prob_extractor(gensim_hdp):
shown_topics = gensim_hdp.show_topics(num_topics=gensim_hdp.m_T, formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})
#user3907335 is exactly correct here: HDP will calculate as many topics as the assigned truncation level. However, it may be the case that many of these topics have basically zero probability of occurring. To help with this in my own work, I wrote a handy little function that performs a rough estimate of the probability weight associated with each topic. Note that this is a rough metric only: it does not account for the probability associated with each word. Even so, it provides a pretty good metric for which topics are meaningful and which aren't:
import pandas as pd
import numpy as np
def topic_prob_extractor(hdp=None, topn=None):
topic_list = hdp.show_topics(topics=-1, topn=topn)
topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
split_list = [x.split(' ') for x in topic_list]
weights = []
for lst in split_list:
sub_list = []
for entry in lst:
if '*' in entry:
sub_list.append(float(entry.split('*')[0]))
weights.append(np.asarray(sub_list))
sums = [np.sum(x) for x in weights]
return pd.DataFrame({'topic_id' : topics, 'weight' : sums})
I assume that you already know how to calculate an HDP model. Once you have an hdp model calculated by gensim you call the function as follows:
topic_weights = topic_prob_extractor(hdp, 500)
I think you misunderstood the operation performed by the called method. Directly from the documentation you can see:
Alias for show_topics() that prints the top n most probable words for topics number of topics to log. Set topics=-1 to print all topics.
You trained the model without specifying the truncation level on the number of topics and the default one is 150. Calling the print_topics with topics=-1 you'll get the top 20 words for each topic , in your case 150 topics.
I'm still a newbie of the library, so maybe I' wrong
I haven't used gensim for HDPs, but is it possible that most of the topics in the smaller corpus have extremely low probability of occurring ? Can you trying printing the topic probabilities? Maybe, the length of the topics array doesn't necessarily mean that all those topics were actually found in the corpus.
Deriving the average coherence of HDP topics from their coherence at the individual text level is a way to order (and potentially truncate) them. The following function does just that:
def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
"""
Orders topics based on their average coherence across the corpus
Parameters
----------
dirichlet_model : gensim.models.hdpmodel.HdpModel
bow_corpus : list of lists (contains (id, freq) tuples)
num_topics : int (default=10)
num_keywords : int (default=10)
Returns
-------
ordered_topics: list of lists containing topic tokens
"""
shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
num_words=num_keywords,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_response = [response for response in topic_corpus]
flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]
significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
for topic_num in significant_topics]
topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]
significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # truncate if desired
return ordered_topics
A version of this function that includes an output of the averages coherences associated with the topics for keyword (tag) generation for a corpus can be found in this answer. A similar process for keywords for individual texts can further be found in this answer.

Categories