I am running a couple experiments with the Z3 solver (z3py API), where I measure the quality of the results depending on the timeout, that I set. I am running the different experiments from the same vitualenv but from different classes. After each Experiment i am creating a new solver object like this:
self.solver = z3.Solver()
I have the feeling that results are found faster in the second and so on runs. So I was wondering, whether the z3py API somehow saves some of the preliminary results from previous runs in order to speed up the next one. If so, is there a way to completely reset the solver after a run.
This is extremely unlikely, especially given you're creating the solver anew from scratch. But it is impossible to opine on this as you haven't really shown any code to see if there might be gotchas.
I'd hazard a guess that if you always observe the first solution to be slower than the following ones, make sure you properly account for your Python interpreter to start up, load your program, load all the z3 infrastructure it needs and finally call the solver. Notice that none of that is going to be cheap, especially if the problems you are benchmarking are rather small.
A good way to go would be to toss away the timing results from the first couple of runs, to make sure all the cache-lines in the memory are warmed up and everything is paged in. Then do a comparison of the runs 3 to 15.. Do you still see a difference? That would suggest the presence of other factors, though I doubt it.
But again, it all depends on how you coded this up and what sort of problems you are benchmarking. The random seed chosen by the solver can play a role, but the impact of that should be randomly distributed, if any.
Related
I have a Python script containing a loop with a lot of calls to scipy.optimize.fsolve (99 (55 + 54) times per time step, and right now I need around 10^5 time steps). The rest of the script isn't very fast either, but as far as I can tell from the output of the Spyder Profiler, the calls to fsolve are by far the most time consuming. With my current settings, running the script takes over 10 hours, so I could do with a little speed-up.
Based on what I read here and elsewhere on internet, I already gave PyPy a first try: installed it in a separate environment under conda (MacOS 10.15.5, pypy 7.3.1 and pypy3.6 7.3.1), along with its own versions of numpy, scipy, and pandas, but so far it's actually a bit slower than just Python (195 s vs 171 s, for 100 time steps).
From what I read here (PyPy Status Blog, October '17), this might have to do with using numpy instead of numpypy, and/or repeated allocation of temporary arrays. Apart from calling fsolve 10 million+ times, I'm also using quite a lot of numpy arrays, so that makes sense as far as I can see.
The thing is, I'm not a developer, and I'm completely new to PyPy, so terms like JIT traces don't mean much to me, and deciphering what's in them is likely going to be challenging for me. Moreover, what used to be true in October 2017 may no longer be relevant now. Also, even if I'd manage to speed up the numpy array bit, I'm still not sure about the fsolve part.
Can anyone indicate whether it could be worthwhile for me to invest time in PyPy? Or would Cython be more suitable in this case? Or even mpi4py?
I'd happily share my code if that helps, but it includes a module of over 800 lines of code, so just including it in this post didn't seem like a great idea to me.
Many thanks!
Sita
EDIT: Thanks everyone for your quick and kind response! That's a fair point, about needing to see my code, I put it here (link valid until 19 June 2020). Arterial_1D.py is the module, CoronaryTree.py is the script that calls Arterial_1D.py. For a minimal working example, I put in one extra line, to be uncommented in that case (clearly marked in the code). Also, I set the number of time steps to 100, to have the code run in reasonable time (0.61 s for the minimal example, 37.3 s for the full coronary tree, in my case).
EDIT 2: Silly of me, in my original post I mentioned times of 197 and 171 s for running 100 steps of my code using PyPy and Python, respectively, but in that case I called Python from within the PyPy environment, so it was using the PyPy version of Numpy. From within my base environment running 100 steps takes a little over 30 s. So PyPy is A LOT slower than Python in this case, which motivates me to look into this PyPy Status Blog post anyway.
We can't really help you to optimize without looking at your code. But since you have quite the description going on up there, let me reply with what I think you can try to speed things up.
First thing's first. The Scipy library.
From the source for scipy.optimize.fsolve, it wraps around MINPACK's hybrd and hybrj algorithms which are considerably fast FORTRAN subroutines. So in your case, switching to PyPy is not going to do much good, if any at all for this identified bottleneck.
What can you do to optimize your Scipy fsolve ? One of the most obvious numerical thing to do is to vectorize your function's args. But seems that you are running a sort of time step algorithm and Most standard time stepping algos are not able to vectorize in time. IF your 'XX times per time step' is a sort of implicit spatial loop per time step (i.e. your grid), you can consider vectorizing this to achieve some gains in speed. Next is to zoom into your function's guess / starting root estimate. See if you can mod your algorithm to capitalize on a good starting solution over the whole time interval (do some literature digging). Note that this has less to do with the 'programming' than your knowledge on numerical methods.
Next, on your comment on "rest of the script isn't very fast either". Well, I'd go with Cython to sweat the remaining python parts of your code, esp the loops. It is very actively developed, great community and is battle tested. I personally have used it in many HPC type problems. Cython also has a convenient html annotation that highlights potential optimizations that may be possible over your native python implementation.
Hope this helps! Cheers
I am currently working on an ML NLP project and I want to measure the execution time of certain parts and also potentially predict how long the execution will take. For example, I want to measure the ML training process (including sub-processes like the data preprocessing part). I have been looking online and I have come across different python modules that can measure the execution time of functions (like the time or timeit ones). However, I still haven't found a concrete solution to predict the time it will take for a function to execute. I have thought about running the code several times, save the (data_size, time) values and then use that to extrapolate for future data. I also thought about then updating this estimation with the time it took the run several subparts of a function (like seeing how much of the process was computed, how long it took and then use that to adjust the time left).
However, I am not sure of any of this and I wanted to see if there were better options out there that I wasn't aware of, so if anyone has a better idea, I'd be thankful if you could share it.
Have you looked into using profiling? It should give a detailed breakdown of the function execution times, the number of calls, etc. You will have to execute the script with profiling, and then you will get the detailed breakdown.
https://docs.python.org/3/library/profile.html#module-cProfile
If you want in-time progress reports there are a couple of libraries I've seen. https://pypi.org/project/tqdm/
https://pypi.org/project/progressbar2/
Hope these help!
I am working on a machine learning project in python and many times I found myself rerun some algorithm with different tweaks each time(changing few parameters, different normalization, some extra feature engineering, etc). Each time most computation is similar except a few steps. I can, of course, save some immediate states on disk and load it next time instead of the computing the same thing over and over again.
The thing is that there are so many such immediate results that manually save them and keep a record of them would be a pain. I looked at some python decorator here that can make things a bit easier. However, the problem with this implementation is, that it will always return the same result from the first time you called the function, even when your function has arguments and therefore should produce different results for different arguments. I really need to memorize the output of a function with different arguments.
I googled extensively on this topic and the closest thing that I found that is IncPy by Philip Guo. IncPy (Incremental Python) is an enhanced Python interpreter that speeds up script execution times by automatically memoizing (caching) the results of long-running function calls and then re-using those results rather than re-computing, when safe to do so.
I really like the idea and think it will be very useful for data science and machine learning but the code is written nine years ago for python 2.6 and is no longer maintained.
So my question is that is there any other alternative automatical caching/memorizing techniques in python that can handle relatively large dataset?
Is there a particular software resource monitor that researchers or academics use to compare execution time and other resource usage metrics between programming environments? For instance, if I have a routine in C++, python and another in Matlab, that are all identical in function and similar implantations -how would I make an objective, measurable result comparison as to which was the most efficient process. Likewise is it a tool that could also analyze performance between versions of the same code to track improvements in processing efficiency. Please try to answer this question without generalizations like "oh, C++ is always more efficient than python and python will always be more efficient than Matlab."
The correct way is to write tests. Get current time before actual algo starts, and get current time after it ends. There are ways to do that in c++, python and matlab
You must not think of results as they are 100% precision because of system scheduling process etc, though it is a good way to compare before-after results.
Good way to get more precision results is to run your code multiple times.
I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.