LDA Mallet Multiprocessing Freezing - python

So I am trying to run LDA mallet on a dataset. It takes in lemma tokens and a bunch of texts which is our dataset. The issue is when we run, a freeze message pops up and all of our old methods that have already ran start running again. It says its due to the multiprocessing starting before the other finished. Not sure how to fix. This is ran on MacOS. Code and output are below.
import gensim
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import os.path
def optimize_parameters(lemma_tokens, texts):
os.environ['MALLET_HOME'] = '****/mallet-2.0.8'
mallet_path = '****/mallet-2.0.8/bin/mallet'
id2word = Dictionary(lemma_tokens)
# Filtering Extremes
id2word.filter_extremes(no_below=2, no_above=.99)
# Creating a corpus object
corpus = [id2word.doc2bow(d) for d in lemma_tokens]
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, workers = 4)
coherencemodel = CoherenceModel(model=model, texts=lemma_tokens, dictionary=id2word, coherence='c_v')
coherence = coherencemodel.get_coherence()
The "****" is the rest of the path that can't be shown due to privacy.
The error output:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
<10> LL/token: -6.83952
<20> LL/token: -6.70949

I figure it out. You have to put the entire script in
if __name__ == '__main__':
imports
code
Found solution via an old google chat. Posted link:
https://groups.google.com/g/gensim/c/-gMNdkujR48/m/i4Dn1_bjBQAJ
To summarize what is happening, due to multiprocessing, the other bits of code are run multiple times instead of the once it is supposed to. This is the same case for the actual function itself which runs the same call multiple times. The fix of the if statement checks to see if this is the first run through. If it is, then we do the entire call. If not, we don't run anything at all. This works because it makes sure that we are only running it once.

Related

Why does a call to torch.tensor inside of apply_async fail to complete (seems to block execution)?

I'm trying to understand why the following simple example doesn't successfully complete execution and seems to get stuck on the first line of really_simple_func (on Ubuntu machines, but not Windows). The code is:
import torch as t
import numpy as np
import multiprocessing as mp # I've tried both multiprocessing
# import torch.multiprocessing as mp # and torch.multiprocessing
def really_simple_func():
temp_val_2 = t.tensor(np.zeros(425447)[0:400000]) # this is the line that blocks.
return 4.3
if __name__ == "__main__":
print("Run brief starting")
some_zeros = np.zeros(425447)
temp_val = t.tensor(some_zeros[0:400000]) # DELETE THIS LINE TO MAKE IT WORK
pool = mp.Pool(processes=1)
job = pool.apply_async(really_simple_func)
print("just before job.get()")
result = job.get()
print("Run brief completed. Reward = {}".format(result))
I have torch 1.11.0 installed, numpy 1.22.3 and have tried both CPU and GPU versions of Torch. When I run this code on two different Ubuntu machines, I get the following output:
Run brief starting
just before job.get()
However, the code never successfully completes (doesn't print the "Run brief completed" line). (It does complete on a third Windows box).
On the Ubuntu machines, if I delete the line with the comment "#DELETE THIS LINE TO MAKE IT WORK" the execution DOES complete, printing the final line as expected. Similarly, if I leave the line defining temp_val in but delete the line with the comment "This is the line that blocks" it will also complete. Moreover, if I reduce the size of the temp_val tensor (say from 400000 to 4000) it will also complete successfully. Finally, it is worth noting that while I can reproduce this behaviour on two different Ubuntu machines, this code does actually complete on my Windows machine - though, as far as I can tell, the versions of key packages, such as torch, are the same.
I don't understand this behaviour. I suspect it is something to do with the way torch allocates memory or stores information. I've tried calling del temp_val to free up memory, but that doesn't seem to fix things. It seems to me that the async call to t.tensor within really_simple_func is stopped from completing if there has already been a call to t.tensor in the main code block, creating a sufficiently large tensor.
I don't understand why this is happening, or even if that is the correct explanation. In any case, what would be best practice if I do need to do some tensor processing within apply_async as well as in the main thread? More generally, what is Torch waiting on when I make a call to t.tensor?
(Obviously, this is just the simplest version of the real code I'm trying to get to work that reproduced this issue. I realise that calling mp.Pool with only one process doesn't really make sense...nor, indeed, does using apply_async to call a function that returns a constant!)
Unfortunately, I cannot provide any answers to your questions.
I can, however, share experiences with seemingly the same issue. I use a Linux machine with torch 1.8.1 and numpy 1.19.2.
When I run the following code on my machine:
with Pool(max_pool) as p:
pool_outputs = list(
tqdm(
p.imap(lambda f: get_model_results_per_query_file(get_preds, tokenizer, f), query_files),
total=len(query_files)
)
)
For which the function get_model_results_per_query_file contains operations similar to the following:
feats = features.unsqueeze(0).repeat(batch_size, 1, 1).to(device)
(features is a torch tensor)
The first round of jobs automatically fail, and new ones are immediately started (that do not fail for some reason). The whole process never completes though, since the main process still seems to be waiting for the first failed jobs.
If I remove the lines in my code involving the repeat function, no jobs fail.
I managed to solve my issue and preserve the same results by adapting a similar solution to yours:
feats = torch.as_tensor(np.tile(features, (batch_size, 1, 1))).to(device)
I believe as_tensor works in a similar fashion to from_numpy in this case.
I only managed to find this solution thanks to your post and your proposed workaround, so thank you!
After some further exploration, here is a brief answer to my own question.
While I still don't fully understand the blocking behaviour (and would welcome any further explanation), I have just seen that the way I'm generating torch tensors from a numpy array is not correct.
In particular, instead of using torch.tensor(temp_val) where temp_val is a numpy array, I should be using torch.from_numpy(temp_val). Doing this fixes the problem.
Alternatively, I can convert temp_val into a list and then create the tensor via torch.tensor(temp_val_as_list) - which also avoids the issue.

Concurrent.futures.map initializes code from beginning

I am a fairly beginner programmer with python and in general with not that much experience, and currently I'm trying to parallelize a process that is heavily CPU bound in my code. I'm using anaconda to create environments and Visual Code to debug.
A summary of the code is as following :
from tkinter import filedialog
import myfuncs as mf, concurrent.futures
file_path = filedialog.askopenfilename('Ask for a file containing data')
# import data from file_path
a = input('Ask the user for input')
Next calculations are made from these and I reach a stage where I need to iterate of a list of lists. These lists may contain up to two values and calls are made to a separate file.
For example the inputs are :
sub_data1 = [test1]
sub_data2 = [test1, test2]
dataset = [sub_data1, sub_data2]
This is the stage I use concurrent.futures.ProcessPoolExecutor()-instance and its .map() method :
with concurrent.futures.ProcessPoolExecutor() as executor:
sm_res = executor.map(mf.process_distr, dataset)
While inside a myfuncs.py, the mf.process_distr() function works like this :
def process_distr(tests):
sm_reg = []
for i in range(len(tests)):
if i==0:
# do stuff
sm_reg.append(result1)
else:
# do stuff
sm_reg.append(result2)
return sm_reg
The problem is that when I try to execute this code on the main.py file, it seems that the main.py starts running multiple times, and asks for user inputs and file dialog pops up multiple times (same amount as cores count).
How can I resolve this matter?
Edit: After reading more into it, encapsulating the whole main.py code with:
if __name__ == '__main__':
did the trick. Thank you to anyone who gave time to help with my rookie problem.

tsfresh extract_features runtime error "freeze_support"

I recently installed the tsfresh package to extract features of my timeseries data. I tried to run the example in the documentation and got the following error:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
I'm a bit confused since it's literally the example code:
from tsfresh.examples import load_robot_execution_failures
from tsfresh import extract_features
df, _ = load_robot_execution_failures()
X = extract_features(df, column_id='id', column_sort='time')
I get the same error when i try the function with my own data.
What am i doing wrong?
I think there is solution to the problem.
You can try to put "if __name__ == __main__:" around function callings.

How to store the last displayed value from the console?

My python script passes changing inputs to a program called "Dymola", which in turn performs a simulation to generate outputs. Those outputs are stored as numpy arrays "out1.npy".
for i in range(0,100):
#code to initiate simulation
print(startValues, 'ParameterSet:', ParameterSet,'time:', stoptime)
np.save('out1.npy', output_data)
Unfortunately, Dymola crashes very often, which makes it necessary to rerun the loop from the time displayed in the console when it has crashed (e.g.: 50) and increase the number of the output file by 1. Otherwise the data from the first set would be overwritten.
for i in range(50,100):
#code to initiate simulation
print(startValues, 'ParameterSet:', ParameterSet,'time:', stoptime)
np.save('out2.npy', output_data)
Is there any way to read out the 'stoptime' value (e.g. 50) out of the console after Dymola has crashed?
I'm assuming dymola is a third-party entity that you cannot change.
One possibility is to use the subprocess module to start dymola and read its output from your program, either line-by-line as it runs, or all after the created process exits. You also have access to dymola's exit status.
If it's a Windows-y thing that doesn't do stream output but manipulates a window GUI-style, and if it doesn't generate a useful exit status code, your best bet might be to look at what files it has created while or after it has gone. sorted( glob.glob("somepath/*.out")) may be useful?
I assume you're using the dymola interface to simulate your model. If so, why don't you use the return value of the dymola.simulate() function and check for errors.
E.g.:
crash_counter = 1
from dymola.dymola_interface import DymolaInterface
dymola = DymolaInterface()
for i in range(0,100):
res = dymola.simulate("myModel")
if not res:
crash_counter += 1
print(startValues, 'ParameterSet:', ParameterSet,'time:', stoptime)
np.save('out%d.npy'%crash_counter, output_data)
As it is sometimes difficult to install the DymolaInterface on your machine, here is a useful link.
Taken from there:
The Dymola Python Interface comes in the form of a few modules at \Dymola 2018\Modelica\Library\python_interface. The modules are bundled within the dymola.egg file.
To install:
The recommended way to use the package is to append the \Dymola 2018\Modelica\Library\python_interface\dymola.egg file to your PYTHONPATH environment variable. You can do so from the Windows command line via set PYTHONPATH=%PYTHONPATH%;D:\Program Files (x86)\Dymola 2018\Modelica\Library\python_interface\dymola.egg.
If this does not work, append the following code before instantiating the interface:
import os
import sys
sys.path.insert(0, os.path.join('PATHTODYMOLA',
'Modelica',
'Library',
'python_interface',
'dymola.egg'))

LDA model and joblib parallel computing error

I have some code that does an LDA model on a bunch of CSV lines.
lda_model = LatentDirichletAllocation(
n_components=20, # Number of topics
max_iter=10, # Max learning iterations
learning_method='online',
random_state=100, # Random state
batch_size=128, # n docs in each learning iter
evaluate_every = -1, # compute perplexity every n iters, default: Don't
n_jobs = -1, # Use all available CPUs
)
if __name__ == "__main__":
lda_output = lda_model.fit_transform(data_vectorized)
print(lda_model)
I ran it the first time without the if_name_ line and had no issues. The second time I ran it, I had the error
ImportError: [joblib] Attempting to do parallel computing without
protecting your import on a system that does not support forking. To
use parallel-computing in a script, you must protect your main loop
using "if __name__ == '__main__'". Please see the joblib documentation
on Parallel for more information.
So I tried to add the if_name_ code to get it to work. I'm still having issues. I've tried inserting that in various spots of my code (I'm on Windows) and not getting anything to work. Do I need to add something else?

Categories