I am working on an optimization algorithm which I need to run at least 100 times to see its performance. The entire script is a straight loop script, running a specific set of code multiple times over. The problem is that this entire thing takes up to 10 hours on a small dataset.
Is it possible to run this on a platform so that I can decrease this time? Can I run it faster on the cloud?
I suggest to you to "divide and conquer" your algo.
First, take a small sample of data in order to explore the code without lost too much time. After that you have a manageable piece of code, you can appy some kind of profiling tools to see where your time is spent.
Related
The line_profiler is very handy to analyze code performance line by line. I have a python function that is called N times and I want to measure its time performance using line_profiler. The problem is, python takes extra time to compile the function during the first running loop. And in my case, I am not interested to take into consideration the compiling time when analyzing the performance. Is there a way to ignore the time overhead of compiling the code when profiling the code in python?
I am calling a function and saving the two outputs to variables but this process is taking time because the outputs are generated by solving an ODE.
Is it possible to use multiple cores to run the function faster so the values are saved sooner? If so, could someone provide a simple example?
Thank you
Simply running the same code on multiple cores will not make it run faster. It really depends on the type of tasks your doing. Here are some questions you need to find answers to before you can decide if the code will benefit from parallel processing:
Are the steps in your computation sequence dependent? In other words, does one part of the code depend on the calculations done in the previous? Or can some of them be calculated in parallel? Look at Amdahl's law to learn about how much speedup to expect based on how much of your code you can parallelize
Does your code involve lots of reading/writing to disk and memory? Or is it just lots of computation? If you are doing significant reads and rights to disk, then creating multiple processes to do other work while your threads wait for disk can result in significant speedups. But again, this depend on your answer to the previous point about sequence dependency
How long does your code currently take to run? And is the overhead of creating multiple processes going to be more than the time it takes to run sequentially? In your question you don't give specific times - if you're talking about speeding up a task that takes a few seconds then the time required to create multiple processes might be significant compared to the time for the total task. But if you're talking about a task that takes minutes, then the overhead won't be as significant
Have you considered whether your code is is data parallel or task parallel? If so, you can decide if you want to parallelize using CPU or GPU. For large mathematical operations, look at Numpy for CPU-based and Cupy for GPU-based operations.
I am currently working on an ML NLP project and I want to measure the execution time of certain parts and also potentially predict how long the execution will take. For example, I want to measure the ML training process (including sub-processes like the data preprocessing part). I have been looking online and I have come across different python modules that can measure the execution time of functions (like the time or timeit ones). However, I still haven't found a concrete solution to predict the time it will take for a function to execute. I have thought about running the code several times, save the (data_size, time) values and then use that to extrapolate for future data. I also thought about then updating this estimation with the time it took the run several subparts of a function (like seeing how much of the process was computed, how long it took and then use that to adjust the time left).
However, I am not sure of any of this and I wanted to see if there were better options out there that I wasn't aware of, so if anyone has a better idea, I'd be thankful if you could share it.
Have you looked into using profiling? It should give a detailed breakdown of the function execution times, the number of calls, etc. You will have to execute the script with profiling, and then you will get the detailed breakdown.
https://docs.python.org/3/library/profile.html#module-cProfile
If you want in-time progress reports there are a couple of libraries I've seen. https://pypi.org/project/tqdm/
https://pypi.org/project/progressbar2/
Hope these help!
I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:
Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.
I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.