I would like to speed up a SimPy simulation (if possible), but I'm not sure the best way to insert timers to even see what is taking long.
Is there a way to do this?
I would recommend using runsnakerun (or I guess snakeviz in py3x), which uses cProfile(there are directions on runsnakerun's webpage)
basically you just run your program
python -m cProfile -o profile.dump my_main.py
then you can get a nice visual view of your profile with runsnake (or snakeviz if using py3)
python runsnakerun.py profile.dump
(note that running it in profile mode will probably slow down your code even more ... but its really just to identify slow parts)
import time
t1 = time.time()
#code to time
t2 = time.time()
print(t2 - t1)
You can use this and compare the times with all code samples you want to test
Related
I'd like to know how'd you measure the amount of clock cycles per instruction say copy int from one place to another?
I know you can time it down to nano seconds but with today's cpu's that resolution is too low to get a correct reading for the oprations that take just a few clock cycles?
It there a way to confirm how many clock cycles per instructions like adding and subing it takes in python? if so how?
This is a very interesting question that can easily throw you into the rabbit's hole. Basically any CPU cycle measurements depends on your processors and compilers RDTSC implementation.
For python there is a package called hwcounter that can be used as follows:
# pip install hwcounter
from hwcounter import Timer, count, count_end
from time import sleep
# Method-1
start = count()
# Do something here:
sleep(1)
elapsed = count_end() - start
print(f'Elapsed cycles: {elapsed:,}')
# Method-2
with Timer() as t:
# Do something here:
sleep(1)
print(f'Elapsed cycles: {t.cycles:,}')
NOTE:
It seem that the hwcounter implementation is currently broken for Windows python builds. A working alternative is to build the pip package using the mingw compiler, instead of MS VS.
Caveats
Using this method, always depend on how your computer is scheduling tasks and threads among its processors. Ideally you'd need to:
bind the test code to one unused processor (aka. processor affinity)
Run the tests over 1k - 1M times to get a good average.
Need a good understanding of not only compilers, but also how python optimize its code internally. Many things are not at all obvious, especially if you come from C/C++/C# background.
Rabbit Hole:
http://en.wikipedia.org/wiki/Time_Stamp_Counter
https://github.com/MicrosoftDocs/cpp-docs/blob/main/docs/intrinsics/rdtsc.md
How to get the CPU cycle count in x86_64 from C++?
__asm
__rdtsc
__cpuid, __cpuidex
Defining __asm Blocks as C Macros
I have a python code that runs a 2D diffusion simulation for a set of parameters. I need to run the code many times, O(1000), like a Monte Carlo approach, using different parameter settings each time. In order to do this more quickly I want to use all the cores on my machine (or cluster), so that each core runs one instance of the code.
In the past I have done this successfully for serial fortran codes by writing a python wrapper that then used multiprocessing map (or starmap in the case of multiple arguments) to call the fortan code in an ensemble of simulations. It works very nicely in that you loop over the 1000 simulations, and the python wrapper farms out a new integration to a core as soon as it becomes free after completing a previous integration.
However, now when I set this up to do the same to run multiple instances of my python (instead of fortran) code, I find it is incredibly slow, much slower than simply running the code 1000 times in serial on a single core. Using the system monitor I see that one core is working at a time, and it never goes above 10-20% load, while of course I expected to see N cores running near 100% (as is the case when I farm out fortran jobs).
I thought it might be a write issue, and so I checked the code carefully to ensure that all plotting is switched off, and in fact there is no file/disk access at all, I now merely have one print statement at the end to print out a final diagnostic.
The structure of my code is like this
I have the main python code in toy_diffusion_2d.py which has a single arg of a dictionary with the run parameters in it:
def main(arg)
loop over timesteps:
calculation simulation over a large-grid
print the result statistic
And then I wrote a "wrapper" script, where I import the main simulation code and try to run it in parallel:
from multiprocessing import Pool,cpu_count
import toy_diffusion_2d
# dummy list of arguments
par1=[1,2,3]
par2=[4,5,6]
# make a list of dictionaries to loop over, 3x3=9 simulations in all.
arglist=[{"par1":p1,"par2":p2} for p1 in par1 for p2 in par2]
ncore=min(len(arglist),int(cpu_count()))
with Pool(processes=ncore) as p:
p.map(toy_diffusion_2d.main,arglist)
The above is a shorter paraphrased example, my actual codes are longer, so I have placed them here:
Main code: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_2d.py
You can run this with the default values like this:
python3 toy_diffusion_2d.py
Wrapper script: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_loop.py
You can run a 4 member ensemble like this:
python3 toy_diffusion_loop.py --diffK=[33000,37500] --tau_sub=[20,30]
(note that the final stat is slightly different each run, even with the same values as the model is stochastic, a version of the stochastic allen-cahn equations in case any one is interested, but uses a stupid explicit solver on the diffusion term).
As I said, the second parallel code works, but as I say it is reeeeeallly slow... like it is constantly gating.
I also tried using starmap, but that was not any different, it is almost like the desktop only allows one python interpreter to run at a time...? I spent hours on it, I'm almost at the point to rewrite the code in Fortran. I'm sure I'm just doing something really stupid to prevent parallel execution.
EDIT(1): this problem is occurring on
4.15.0-112-generic x86_64 GNU/Linux, with Python 3.6.9
In response to the comments, in fact I also find it runs fine on my MAC laptop...
EDIT(2): so it seems my question was a bit of a duplicate of several other postings, apologies! As well as the useful links provided by Pavel, I also found this page very helpful: Importing scipy breaks multiprocessing support in Python I'll edit in the solution below to the accepted answer.
The code sample you provide works just fine on my MacOS Catalina 10.15.6. I can guess you're using some Linux distributive, where, according to this answer, it can be the case that numpy import meddles with core affinity due to being linked with OpenBLAS library.
If your Unix supports scheduler interface, something like this will work:
>>> import os
>>> os.sched_setaffinity(0, set(range(cpu_count)))
Another question that has a good explanation of this problem is found here and the solution suggested is this:
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
inserted right before the multiprocessing call.
I am starting to use cProfile to profile my python script.
I have noticed something very weird.
When I use time to measure the running time of my script it takes 4.3 seconds.
When I use python -m cProfile script.py it takes 7.3 seconds.
When running the profiler within the code:
import profile
profile.run('main()')
it takes 63 seconds!!
I can understand why it might take a little more time when adding profiling, but why there is such a difference between using cProfile from outside or as part of the code?
Is there a reason why it takes so much time when I use profile.run?
Oddly enough, what you're seeing is expected behavior. In the introduction to the profilers section of the Python docs, it states that profile adds "significant overhead to profiled programs" as compared to cProfile. The difference you're seeing is in the libraries you're using, not how you're calling them. Consider this script:
import profile
import cProfile
def nothing():
return
def main():
for i in xrange(1000):
for j in xrange(1000):
nothing()
return
cProfile.run('main()')
profile.run('main()')
The output from cProfile shows main takes about 0.143 seconds to run, while the profile variant reports 1.645 seconds, which is ~11.5 times longer.
Now let's change the script again to this:
def nothing():
return
def main():
for i in xrange(1000):
for j in xrange(1000):
nothing()
return
if __name__ == "__main__":
main()
And call it with the profilers:
python -m profile test_script.py
Reports 1.662 seconds for main to run.
python -m cProfile test_script.py
Reports 0.143 seconds for main to run.
This shows that the way you launch the profilers has nothing to do with the discrepancy you've seen between cProfile and profile. The difference is caused by how the two profilers handle "events" such as function calls or returns. In both cases, there are software hooks all over your executing code that trigger callbacks to track these events and do things like update counters for the events and start or stop timers. However, the profile module handles all of these events natively in Python, which means your interpreter has to leave your code, execute the callback stuff, then return to continue with your code.
The same things have to happen with cProfile (execute the profiling callbacks), but it's much faster because the callbacks are written in C. A look at the two module files profile.py and cProfile.py demonstrates some differences:
profile.py is 610 lines while cProfile.py is only 199 - most of its functions are handled in C.
profile.py primarily utilizes Python libraries while cProfile.py imports "_lsprof", a C code file. Source can be viewed here.
The Profile class in profile.py doesn't inherit from any other class (line 111), while the Profile class in cProfile.py (line 66) inherits from _lsprof.Profiler, which is implemented in the C source file.
As the docs state, cProfile is generally the way to go, simply because it is mostly implemented in C, so everything is faster.
As an aside, you can improve the performance of profile by calibrating it. Details on how to do that are available in the docs There are more details about how/why all this stuff is the way it is in the Python docs sections on Deterministic Profiling and limitations.
TL;DR
cProfile is much faster because, as its name implies, most of it is implemented in C. This is in contrast to the profile module, which has to handle all of the profiling callbacks in native Python. Whether you invoke the profilers from the command line or manually inside your script has no effect on the time difference between the two modules.
Using python for starting modelling/simulation runs (tuflow) and logging the runs to db
Currently on windows , python 2.7 and using timeit() .
Is it better to stick with using timeit() or to switch to using time.clock() ?
Simulation/modelling runs can be anything from a couple of minutes to a week+.
Need reasonably accurate runtimes.
I know time.clock has been depreciated on 3.3+ but I can't see this code getting moved to 3 for a long while.
self.starttime = timeit.default_timer()
run sim()
self.endtime = timeit.default_timer()
self.runtime = self.starttime - self.endtime
timeit's `timeit() function performs the code multiple times and takes the best, so it's not a great choice for lengthy runs. But your code just uses the timer, whose documentation warns that it measures elapsed time, so you will need to ensure standard running conditions in so far as you can. If you want to measure actual CPU usage, that's tricker.
When you say "reasonably accurate" times, one percent of two minutes is 2.4 seconds. The precision of time.clock() or the default timer is never going to be less than a second. Is that accurate enough?
To mitigate any possibility of migration to Python 3 you can define your own timing function. On Python 2 it can use time.clock() but at least you will only have one place to alter code if you do migrate.
I've been working on some Project Euler problems in Python 3 [osx 10.9], and I like to know how long they take to run.
I've been using the following two approaches to time my programs:
1)
import time
start = time.time()
[program]
print(time.time() - start)
2) On the bash command line, typing time python3 ./program.py
However, these two methods often give wildy different results. In the program I am working on now, the first returns 0.000263 (seconds, truncated) while the second gives
real 0m0.044s
user 0m0.032s
sys 0m0.009s
Clearly there is a huge discrepancy - two orders of magnitude compared to the real time.
My questions are:
a) Why the difference? Is it overhead from the interpreter?
b) Which one should I be using to accurately determine how long the program takes to run? Is time.time() accurate at such small intervals?
I realize these miniscule times are not of the utmost importance; this was more of a curiosity.
Thanks.
[UPDATE:]
Thank-you to all of the answers & comments. You were correct with the overhead. This program:
import time
start = time.time()
print("hello world")
print(time.time() - start)
takes ~0.045 sec, according to bash.
My complicated Project Euler problem took ~0.045 sec, according to bash. Problem solved.
I'll take a look at timeit. Thanks.
The interpreter imports site.py and can touch upon various other files on start-up. This all takes time before your import time line is ever executed:
$ touch empty.py
$ time python3 empty.py
real 0m0.158s
user 0m0.033s
sys 0m0.021s
When timing code, take into account that other processes, disk flushes and hardware interrupts all take time too and influence your timings.
Use timeit.default_timer() to get the most accurate timer for your platform, but preferably use the timeit module itself to time individual snippets of code to eliminate as many variables as possible.
Because when you run the time builtin in bash the real time taken includes the time taken to start up the Python interpreter and import the required modules to run your code, rather than just timing the execution of a single function in your code.
To see this, try for example
import os
import time
start = time.time()
os.system('python <path_to_your_script>')
print time.time() - start
You'll find that this is much closer to what time reports.