How to have multiple MLFlow runs in parallel?

How to have multiple MLFlow runs in parallel? - python

I'm not very familiar with parallelization in Python and I'm getting an error when trying to train a model on multiple training folds in parallel. Here's a simplified version of my code:
def train_test_model(fold):
# here I train the model etc...
# now I want to save the parameters and metrics
with mlflow.start_run():
mlflow.log_param("run_name", run_name)
mlflow.log_param("modeltype", modeltype)
# and so on...
if __name__=="__main__":
pool = ThreadPool(processes = num_trials)
# run folds in parallel
pool.map(lambda fold:train_test_model(fold), folds)
I'm getting the following error:
Exception: Run with UUID 23e9bb6d22674a518e48af9c51252860 is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
The documentation says that mlflow.start_run() starts a new run and makes it active which is the root of my problem. Every thread starts a MLFlow run for its corresponding fold and makes it active while I need the runs to run in parallel i.e. all be active(?) and save parameters/metrics of the corresponding fold. How can I solve that issue?

I found a solution, maybe it will be useful for someone else. You can see details with code examples here: https://github.com/mlflow/mlflow/issues/3592

Related

Jupyter Notebook GPU memory release after training model

How could we clear up the GPU memory after finishing a deep learning model training with Jupyter notebook. The problem is, no matter what framework I am sticking to (tensorflow, pytorch) the memory stored in the GPU do not get released except I kill the process manually or kill the kernel and restart the Jupyter. Do you have any idea how we can possible get rid of this problem by automating the steps?

The only walkaround I found was to use threading. Executing the Training with a subprocess.
An example:
def Training(arguments):
....
....
return model
if __name__=='__main__':
Subprocess = Process(# The complete function defined above
target = Training,
# Pass the arguments defined in the complete function
# above.
# Note - The comma after the arguments : In order for
# Python
# to understand this is a tuple
args = (arguments, ))
# Starting the defined subprocess
Subprocess.start()
# Wait for the subprocess to get completed
Subprocess.join()

How to loop through different config files one at a time?

I want to train a ML model with different configurations. I have a few config files in the config folder and hope to keep testing them one after another so that I don't have to wait for each training to finish and manually run train.py each time.
I thought I could just use a for loop like this:
from train import train_on_config
configs = ['config1.cfg', 'config2.cfg', 'config3.cfg']
for config in configs:
train_on_config(config)
Since each run of train_on_config(str: config) would take hours, my concern is that if the for loop is going to wait for each run to finish or will it iterate through configs right away and have all 3 runs of train_on_config at the same time, which in my case I would like to avoid.
I am aware of cron for Linux, but that seems to only to schedule regular occurrances, not one after another for different configs...
Overall, I just want to make sure the for loop will run train_on_config(str: config) one at a time.

Code is executed sequentially. Each call will execute after the previous one finishes.

PyQt5 run sklearn calculations on a separate QThread

I am creating a PyQt5 application and want to run some ML code in a separate thread when a button is clicked. I am unable to do so. This is my threading code:
newthread = QtCore.QThread()
runthread = RunThread()
runthread.moveToThread(newthread)
runthread.finished.connect(newthread.quit)
newthread.started.connect(runthread.long_run)
newthread.start()
And the class RunThread is as follows:
class RunThread(QtCore.QObject):
finished = QtCore.pyqtSignal()
def __init__(self):
super().__init__()
#pyqtSlot()
def long_run(self):
#ML code called in the line below
simulation.start_simulation()
self.finished.emit()
Running it normally doesnt work. Pycharm quits with the following error:
process finished with exit code -1073740791 (0xc0000409)
Running it in debug mode throws thousands of warnings thrown by sklearn saying:
Multiprocessing-backed parallel loops cannot be nested below threads,
setting n_jobs=1
The code runs eventually, but it takes a lot more time (at least 4x times) to do so.
Could someone please let me know what the issue here is?

The warning from sklearn is pretty explicit here. Basically you cannot train sklearn models with hyperparameter n_jobs set to anything greater than 1 when executing from a nested thread.
That being said, without seeing what is going on inside of simulation.start_simulation() it's hard to conjecture.
My best guess would be to look for anything in start_simulation() that is using multiple jobs and see what happens when you set that to n_jobs=1.
Alternatively, you may try writing your ML code as a standalone script and execute it using QProcess. That may allow you to utilize n_jobs greater than 1.

Queue manager for experiments in neural network training

I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?

Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program

Is Spark's KMeans unable to handle bigdata?

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error!
Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization):
from pyspark.context import SparkContext
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.random import RandomRDDs
if __name__ == "__main__":
sc = SparkContext(appName='kmeansMinimalExample')
# same with 10000 points
data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64)
C = KMeans.train(data, 8192, maxIterations=10)
sc.stop()
The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting:
If I use k=81, instead of 8192, it will succeed:
Notice that the two calls of takeSample(), should not be an issue, since there were called twice in the random initialization case.
So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce?
If it was a memory issue, I would get warnings and errors, as I had been before.
Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs.
From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment...

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization.
I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. Edit: really, see also https://issues.apache.org/jira/browse/SPARK-11560
First, there are some code optimizations that would speed up the init by about 13%.
However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now.
In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

If your RDD is so large the collectAsMap will attempt to copy every single element in the RDD onto the single driver program, and then run out of memory and crash. Even though you had partitioned the data, the collectAsMap sends everything to the driver and you job crashs.
You can make sure the number of elements you return is capped by calling take or takeSample, or perhaps filtering or sampling your RDD.
Similarly, be cautious of these other actions as well unless you are sure your dataset size is small enough to fit in memory:
countByKey,
countByValue,
collect
If you really do need every one of these values of the RDD and the data is too big to fit into memory, you could write out the RDD to files or export the RDD to a database that is large enough to hold all the data. As you are using an API, I think you are not able to do that (rewrite all the code maybe? Increase Memory?). I think this collectAsMap in the runAlgorithm method is a really bad thing in Kmeans (https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html)...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.