I'm trying to profile my python script using cProfile and displaying the results with pstats. In particular, I'm trying to use the pstats function p.sort_stats('time').print_callers(20) to print only the top 20 functions by time, as decribed in the documentation.
I expect to get only the top 20 results (functions profiled and their calling functions ordered by time), instead, I get a seemingly unfiltered list of over 1000 functions that completely saturates my terminal (hence I'm estimating over a 1000 functions).
Why is my restriction argument (i.e. 20) being ignored by print_callers() and how can I fix this?
I've tried looking up an answer and couldn't find one. And I tried to create a minimal reproducible example, but when I do, I can't reproduce the problem (i.e. it works fine).
my profiling code is:
import cProfile
import pstats
if __name__ == '__main__':
cProfile.run('main()', 'mystats')
p = pstats.Stats('mystats')
p.sort_stats('time').print_callers(20)
I'm trying to avoid having to post my full code, so if someone else has encountered this issue before, and can answer without seeing my full code, that would be great.
Thank you very much in advance.
Edit 1:
Partial output:
Ordered by: internal time
List reduced from 1430 to 1 due to restriction <1>
Function was called by...
ncalls tottime cumtime
{built-in method builtins.isinstance} <- 2237 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
9 0.000 0.000 <frozen importlib._bootstrap_external>:485(_compile_bytecode)
44 0.000 0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
4872 0.001 0.001 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\_strptime.py:321(_strptime)
5 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\abc.py:196(__subclasscheck__)
26 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\calendar.py:58(__getitem__)
14 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\calendar.py:77(__getitem__)
2 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\distutils\version.py:331(_cmp)
20 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\enum.py:797(__or__)
362 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\enum.py:803(__and__)
1 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\inspect.py:73(isclass)
30 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\json\encoder.py:182(encode)
2 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:34(_get_bothseps)
1 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:75(join)
4 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:122(splitdrive)
3 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:309(expanduser)
4 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\os.py:728(check_str)
44 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\re.py:249(escape)
4 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\re.py:286(_compile)
609 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\dateutil\parser\_parser.py:62(__init__)
1222 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py:48(_count_reduce_items)
1222 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py:58(_mean)
1 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\arrayprint.py:834(__init__)
1393 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:1583(ravel)
1239 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:1966(sum)
...
I figured out the issue.
As usual, the Python Library does not have a bug, rather, I misunderstood the output of the function call.
I'm elaborating it here as an answer in case it helps anyone clear up this misunderstanding in the future.
When I asked the question, I didn't understand why p.print_callers(20) prints out to terminal over a thousand lines, even though I am restricting it to the top 20 function calls (by time).
What is actually happening is that my restriction of printing the top 20 "most time consuming functions", restricts the list to the top 20 functions, but then prints a list of all the functions that called each of the top twenty functions.
Since each of the top 20 functions was called (on average) by about 100 different functions each one of the top functions, had about 100 lines associated with it. so 20*100=2000, and so p.print_callers(20) printed well over a 1000 lines and saturated my terminal.
I hope this saves someone some time and debugging headache :)
Related
I have some Python code that is generating a large data set via numerical simulation. The code is using Numpy for a lot of the calculations and Pandas for a lot of the top-level data. The data sets are large so the code is running slowly, and now I'm trying to see if I can use cProfile to find and fix some hot spots.
The trouble is that cProfile is identifying a lot of the hot spots as pieces of code within Pandas, within Numpy, and/or Python builtins. Here are the cProfile statistics sorted by 'tottime' (total time within the function itself). Note that I'm obscuring project name and file names since the code itself is not owned by me and I don't have permission to share details.
foo.sort_stats('tottime').print_stats(50)
Wed Jun 5 13:18:28 2019 c:\localwork\xxxxxx\profile_data
297514385 function calls (291105230 primitive calls) in 306.898 seconds
Ordered by: internal time
List reduced from 4141 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
281307 31.918 0.000 34.731 0.000 {pandas._libs.lib.infer_dtype}
800 31.443 0.039 31.476 0.039 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\numpy\lib\function_base.py:4703(delete)
109668 23.837 0.000 23.837 0.000 {method 'clear' of 'dict' objects}
153481 19.369 0.000 19.369 0.000 {method 'ravel' of 'numpy.ndarray' objects}
5861614 14.182 0.000 78.492 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
5861614 8.891 0.000 8.891 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
5861614 8.376 0.000 99.084 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
26840695 7.032 0.000 11.009 0.000 {built-in method builtins.isinstance}
26489324 6.547 0.000 14.410 0.000 {built-in method builtins.getattr}
11846279 6.177 0.000 19.809 0.000 {pandas._libs.lib.values_from_object}
[...]
Is there a sensible way for me to figure out which parts of my code are excessively leaning on these library functions and built-ins? I anticipate one answer would be "look at the cumulative time statistics, that will probably indicate where these costly calls are originating". The cumulative times give a little bit of insight:
foo.sort_stats('cumulative').print_stats(50)
Wed Jun 5 13:18:28 2019 c:\localwork\xxxxxx\profile_data
297514385 function calls (291105230 primitive calls) in 306.898 seconds
Ordered by: cumulative time
List reduced from 4141 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
643/1 0.007 0.000 307.043 307.043 {built-in method builtins.exec}
1 0.000 0.000 307.043 307.043 xxxxxx.py:1(<module>)
1 0.002 0.002 306.014 306.014 xxxxxx.py:264(write_xxx_data)
1 0.187 0.187 305.991 305.991 xxxxxx.py:256(write_yyyy_data)
1 0.077 0.077 305.797 305.797 xxxxxx.py:250(make_zzzzzzz)
1 0.108 0.108 187.845 187.845 xxxxxx.py:224(generate_xyzxyz)
108223 1.977 0.000 142.816 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:298(_setitem_with_indexer)
1 0.799 0.799 126.733 126.733 xxxxxx.py:63(populate_abcabc_data)
1 0.030 0.030 117.874 117.874 xxxxxx.py:253(<listcomp>)
7201 0.077 0.000 116.612 0.016 C:\LocalWork\xxxxxx\yyyyyyyyyyyy.py:234(xxx_yyyyyy)
108021 0.497 0.000 112.908 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:182(__setitem__)
5861614 8.376 0.000 99.084 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
110024 0.917 0.000 81.210 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3500(apply)
108021 0.185 0.000 80.685 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3692(setitem)
5861614 14.182 0.000 78.492 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
108021 1.887 0.000 73.064 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:819(setitem)
[...]
Is there a good way to pin down the hot spots -- better than "crawl through xxxxxx.py and search for all places where Pandas might be inferring a dataype, and where Numpy might be deleting objects"...?
I am trying to parallelize an embarrassingly parallel for loop (previously asked here) and settled on this implementation that fit my parameters:
with Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets, train_size_common, feat_sel_size, train_perc,
total_test_samples, num_classes, num_features, label_set,
method_names, pos_class_index, out_results_dir, exhaustive_search])
partial_func_holdout = partial(holdout_trial_compare_datasets, *shared_inputs)
with Pool(processes=num_procs) as pool:
cv_results = pool.map(partial_func_holdout, range(num_repetitions))
The reason I need to use a proxy object (shared between processes) is the first element in the shared proxy list datasets that is a list of large objects (each about 200-300MB). This datasets list usually has 5-25 elements. I typically need to run this program on a HPC cluster.
Here is the question, when I run this program with 32 processes and 50GB of memory (num_repetitions=200, with datasets being a list of 10 objects, each 250MB), I do not see a speedup even by factor of 16 (with 32 parallel processes). I do not understand why - any clues? Any obvious mistakes, or bad choices? Where can I improve this implementation? Any alternatives?
I am sure this has been discussed before, and the reasons can be varied and very specific to implementation - hence I request you to provide me your 2 cents. Thanks.
Update: I did some profiling with cProfile to get a better idea - here is some info, sorted by cumulative time.
In [19]: p.sort_stats('cumulative').print_stats(50)
Mon Oct 16 16:43:59 2017 profiling_log.txt
555404 function calls (543552 primitive calls) in 662.201 seconds
Ordered by: cumulative time
List reduced from 4510 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
897/1 0.044 0.000 662.202 662.202 {built-in method builtins.exec}
1 0.000 0.000 662.202 662.202 test_rhst.py:2(<module>)
1 0.001 0.001 661.341 661.341 test_rhst.py:70(test_chance_classifier_binary)
1 0.000 0.000 661.336 661.336 /Users/Reddy/dev/neuropredict/neuropredict/rhst.py:677(run)
4 0.000 0.000 661.233 165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:533(wait)
4 0.000 0.000 661.233 165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:263(wait)
23 661.233 28.749 661.233 28.749 {method 'acquire' of '_thread.lock' objects}
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:261(map)
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:637(get)
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:634(wait)
866/8 0.004 0.000 0.868 0.108 <frozen importlib._bootstrap>:958(_find_and_load)
866/8 0.003 0.000 0.867 0.108 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
720/8 0.003 0.000 0.865 0.108 <frozen importlib._bootstrap>:641(_load_unlocked)
596/8 0.002 0.000 0.865 0.108 <frozen importlib._bootstrap_external>:672(exec_module)
1017/8 0.001 0.000 0.863 0.108 <frozen importlib._bootstrap>:197(_call_with_frames_removed)
522/51 0.001 0.000 0.765 0.015 {built-in method builtins.__import__}
The profiling info now sorted by time
In [20]: p.sort_stats('time').print_stats(20)
Mon Oct 16 16:43:59 2017 profiling_log.txt
555404 function calls (543552 primitive calls) in 662.201 seconds
Ordered by: internal time
List reduced from 4510 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
23 661.233 28.749 661.233 28.749 {method 'acquire' of '_thread.lock' objects}
115/80 0.177 0.002 0.211 0.003 {built-in method _imp.create_dynamic}
595 0.072 0.000 0.072 0.000 {built-in method marshal.loads}
1 0.045 0.045 0.045 0.045 {method 'acquire' of '_multiprocessing.SemLock' objects}
897/1 0.044 0.000 662.202 662.202 {built-in method builtins.exec}
3 0.042 0.014 0.042 0.014 {method 'read' of '_io.BufferedReader' objects}
2037/1974 0.037 0.000 0.082 0.000 {built-in method builtins.__build_class__}
286 0.022 0.000 0.061 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:12(docformat)
2886 0.021 0.000 0.021 0.000 {built-in method posix.stat}
79 0.016 0.000 0.016 0.000 {built-in method posix.read}
597 0.013 0.000 0.021 0.000 <frozen importlib._bootstrap_external>:830(get_data)
276 0.011 0.000 0.013 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/sre_compile.py:250(_optimize_charset)
108 0.011 0.000 0.038 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:626(_construct_argparser)
1225 0.011 0.000 0.050 0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
7179 0.009 0.000 0.009 0.000 {method 'splitlines' of 'str' objects}
33 0.008 0.000 0.008 0.000 {built-in method posix.waitpid}
283 0.008 0.000 0.015 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:128(indentcount_lines)
3 0.008 0.003 0.008 0.003 {method 'poll' of 'select.poll' objects}
7178 0.008 0.000 0.008 0.000 {method 'expandtabs' of 'str' objects}
597 0.007 0.000 0.007 0.000 {method 'read' of '_io.FileIO' objects}
More profiling info sorted by percall info:
Update 2
The elements in the large list datasets I mentioned earlier are not usually as big - they are typically 10-25MB each. But depending on the floating point precision used, number of samples and features, this can easily grow to 500MB-1GB per element also. hence I'd prefer a solution that can scale.
Update 3:
The code inside holdout_trial_compare_datasets uses method GridSearchCV of scikit-learn, which internally uses joblib library if we set n_jobs > 1 (or whenever we even set it). This might lead to some bad interactions between multiprocessing and joblib. So trying another config where I do not set n_jobs at all (which should to default no parallelism within scikit-learn). Will keep you posted.
Based on discussion in the comments, I did a mini experiment, compared three versions of implementation:
v1: basically as same as your approach, in fact, as partial(f1, *shared_inputs) will unpack proxy_manager.list immediately, Manager.List not involved here, data passed to worker with the internal queue of Pool.
v2: v2 made use Manager.List, work function will receive a ListProxy object, it fetches shared data via a internal connection to a server process.
v3: child process share data from the parent, take advantage of fork(2) system call.
def f1(*args):
for e in args[0]: pow(e, 2)
def f2(*args):
for e in args[0][0]: pow(e, 2)
def f3(n):
for i in datasets: pow(i, 2)
def v1(np):
with mp.Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets,])
pf = partial(f1, *shared_inputs)
with mp.Pool(processes=np) as pool:
r = pool.map(pf, range(16))
def v2(np):
with mp.Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets,])
pf = partial(f2, shared_inputs)
with mp.Pool(processes=np) as pool:
r = pool.map(pf, range(16))
def v3(np):
with mp.Pool(processes=np) as pool:
r = pool.map(f3, range(16))
datasets = [2.0 for _ in range(10 * 1000 * 1000)]
for f in (v1, v2, v3):
print(f.__code__.co_name)
for np in (2, 4, 8, 16):
s = time()
f(np)
print("%s %.2fs" % (np, time()-s))
results taken on a 16 core E5-2682 VPC, it is obvious that v3 scales better:
{method 'acquire' of '_thread.lock' objects}
Looking at your profiler output I would say that the shared object lock/unlock overhead overwhelms the speed gains of multithreading.
Refactor so that the work is farmed out to workers that do not need to talk to one another as much.
Specifically, if possible, derive one answer per data pile and then act on the accumulated results.
This is why Queues can seem so much faster: they involve a type of work that does not require an object that has to be 'managed' and so locked/unlocked.
Only 'manage' things that absolutely need to be shared between processes. Your managed list contains some very complicated looking objects...
A faster paradigm is:
allwork = manager.list([a, b,c])
theresult = manager.list()
and then
while mywork:
unitofwork = allwork.pop()
theresult = myfunction(unitofwork)
If you do not need a complex shared object, then only use a list of the most simple objects imaginable.
Then tell the workers to acquire the complex data that they can process in their own little world.
Try:
allwork = manager.list([datasetid1, datasetid2 ,...])
theresult = manager.list()
while mywork:
unitofworkid = allwork.pop()
theresult = myfunction(unitofworkid)
def myfunction(unitofworkid):
thework = acquiredataset(unitofworkid)
result = holdout_trial_compare_datasets(thework, ...)
I hope that this makes sense. It should not take too much time to refactor in this direction. And you should see that {method 'acquire' of '_thread.lock' objects} number drop like a rock when you profile.
I have a program that is reading some data from an Excel spreadsheet (a small one: ~10 sheets with ~100 cells per sheet), doing some calculations, and then writing output to cells in a spreadsheet.
The program ran quickly until I modified to write its output into the same Excel file as where the input is read. Previously I was generating a new spreadsheet and then copying the output into the original file manually.
After the modifications the script's runtime jumped from a few seconds to about 7 minutes. I ran cProfile to investigate and got this output, sorted by cumulative runtime:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 440.918 440.918 xlsx_transport_calc.py:1(<module>)
1 0.000 0.000 437.926 437.926 excel.py:76(load_workbook)
1 0.000 0.000 437.924 437.924 excel.py:161(_load_workbook)
9 0.000 0.000 437.911 48.657 worksheet.py:302(read_worksheet)
9 0.000 0.000 437.907 48.656 worksheet.py:296(fast_parse)
9 0.065 0.007 437.906 48.656 worksheet.py:61(parse)
9225 45.736 0.005 437.718 0.047 worksheet.py:150(parse_column_dimensions)
9292454 80.960 0.000 391.640 0.000 functools.py:105(wrapper)
9292437 62.181 0.000 116.213 0.000 cell.py:94(get_column_letter)
18585439 20.881 0.000 98.832 0.000 threading.py:214(__exit__)
18585443 58.912 0.000 86.641 0.000 threading.py:146(acquire)
18585443 56.600 0.000 77.951 0.000 threading.py:186(release)
9293461/9293452 22.317 0.000 22.319 0.000 {method 'join' of 'str' objects}
37170887 15.795 0.000 15.795 0.000 threading.py:63(_note)
21406059 13.460 0.000 13.460 0.000 {divmod}
37170888 12.853 0.000 12.853 0.000 {thread.get_ident}
18585447 12.589 0.000 12.589 0.000 {method 'acquire' of 'thread.lock' objects}
21408493 9.948 0.000 9.948 0.000 {chr}
21441151 8.323 0.000 8.323 0.000 {method 'append' of 'list' objects}
18585446 7.843 0.000 7.843 0.000 {method 'release' of 'thread.lock' objects}
...
...
...
Relevant code in the script:
...
from openpyxl import load_workbook
import pandas as pd
...
xlsx = 'path/to/spreadsheet.xlsx'
...
def loadxlsx(fname, sname, usecols=None):
with pd.ExcelFile(fname) as ef:
df = ef.parse(sheetname=sname)
if usecols:
return [df.values[:,col] for col in usecols]
else:
return [df.values[:,col] for col in range(df.shape[1])]
...
data = loadxlsx('path/to/spreadsheet.xlsx')
...
<do computations>
...
book = load_workbook(xlsx)
<write data back to spreadsheet>
...
So according to the cProfile output the culprit appears to something within the call to load_workbook. Beyond that observation I'm a bit confused. Why are there 9000 calls to parse_column_dimensions and 18 million calls to various threading functions? And 9 million calls to get_column_letter?
This is the first time I have profiled any python scripts so I'm not sure if this output is normal or not... It seems to have some odd portions though.
Can anyone shed some light on what might be happening here?
I don't know what's happening in the Pandas code but the number of calls is wrong. If you try simply opening the file with openpyxl and modifying the cells in place then it should be a lot faster so it looks like there is some unnecessary looping going on.
I've got an Element class that has some functions, one like this:
def clean(self):
self.dirty = False
I have 1024 elements, and I'm calling clean on each one of them in a while 1: loop.
If I stop calling the clean method, game framerate goes up from 76fps to 250 fps.
This is pretty disturbing. Do I really have to be THIS careful not to completely lag out my code?
Edit (here's the full code):
250 fps code
for layer in self.layers:
elements = self.layers[layer]
for element in elements:
if element.getDirty():
element.update()
self.renderImage(element.getImage(), element.getRenderPos())
element.clean()
76fps code
for layer in self.layers:
elements = self.layers[layer]
for element in elements:
if element.getDirty():
element.update()
self.renderImage(element.getImage(), element.getRenderPos())
element.clean()
2Edit (Here's the profiling results):
Sat Feb 9 22:39:58 2013 stats.dat
23060170 function calls (23049668 primitive calls) in 27.845 seconds
Ordered by: internal time
List reduced from 613 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
3720076 5.971 0.000 12.048 0.000 element.py:47(clean)
909 4.869 0.005 17.918 0.020 chipengine.py:30(updateElements)
3742947 4.094 0.000 5.443 0.000 copy.py:67(copy)
4101 3.972 0.001 3.972 0.001 engine.py:152(killScheduledElements)
11773 1.321 0.000 1.321 0.000 {method 'blit' of 'pygame.Surface' objects}
4 1.210 0.302 1.295 0.324 resourceloader.py:14(__init__)
3720076 0.918 0.000 0.918 0.000 element.py:55(getDirty)
1387 0.712 0.001 0.712 0.001 {built-in method flip}
3742947 0.705 0.000 0.705 0.000 copy.py:102(_copy_immutable)
3728284 0.683 0.000 0.683 0.000 {method 'copy' of 'pygame.Rect' objects}
3743140 0.645 0.000 0.645 0.000 {method 'get' of 'dict' objects}
5494 0.566 0.000 0.637 0.000 element.py:89(isBeclouded)
2296 0.291 0.000 0.291 0.000 {built-in method get}
1 0.267 0.267 0.267 0.267 {built-in method init}
1387 0.244 0.000 25.714 0.019 engine.py:67(updateElements)
2295 0.143 0.000 0.143 0.000 {method 'tick' of 'Clock' objects}
11764 0.095 0.000 0.169 0.000 element.py:30(update)
8214/17 0.062 0.000 4.455 0.262 engine.py:121(createElement)
40 0.052 0.001 0.052 0.001 {built-in method load_extended}
36656 0.046 0.000 0.067 0.000 element.py:117(isCollidingWith)
The profiling says that calling the clean method takes about 6 out of 28 seconds during the profiling. It also gets called 3.7 million times during that time.
That means that the loop you are showing must be the main loop of the software. That main loop also does only the following things:
Checks if the element is dirty.
If it is, it draws it.
Then cleans it.
Since most elements are not dirty (update() only gets called 11 thousand of these 3.7 million loops), the end result is that your main loop is now doing only one thing: Checking if the element is dirty, and then calling .clean() on it.
By only calling clean if the element is dirty, you have effectively cut the main loop in half.
Do I really have to be THIS careful not to completely lag out my code?
Yes. If you have a very tight loop that for the most time does nothing, then you have to make sure that this loop is in fact tight.
This is pretty disturbing.
No, it's a fundamental computing fact.
(comment, but my stats are to low to "comment")
If you are calling element.getDirty() 3.7 million times
and it's only dirty 11 thousand times,
you should be keeping a dirty list, not polling every time.
That is, don't set the dirty flag, but add the dirty element to a dirty element list.
It looks like you might need a dirty list for each layer.
I am attempting to profile a long running python script. The script does some spatial analysis on raster GIS data set using the gdal module. The script currently uses three files, the main script which loops over the raster pixels called find_pixel_pairs.py, a simple cache in lrucache.py and some misc classes in utils.py. I have profiled the code on a moderate sized dataset. pstats returns:
p.sort_stats('cumulative').print_stats(20)
Thu May 6 19:16:50 2010 phes.profile
355483738 function calls in 11644.421 CPU seconds
Ordered by: cumulative time
List reduced from 86 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.008 0.008 11644.421 11644.421 <string>:1(<module>)
1 11064.926 11064.926 11644.413 11644.413 find_pixel_pairs.py:49(phes)
340135349 544.143 0.000 572.481 0.000 utils.py:173(extent_iterator)
8831020 18.492 0.000 18.492 0.000 {range}
231922 3.414 0.000 8.128 0.000 utils.py:152(get_block_in_bands)
142739 1.303 0.000 4.173 0.000 utils.py:97(search_extent_rect)
745181 1.936 0.000 2.500 0.000 find_pixel_pairs.py:40(is_no_data)
285478 1.801 0.000 2.271 0.000 utils.py:98(intify)
231922 1.198 0.000 2.013 0.000 utils.py:116(block_to_pixel_extent)
695766 1.990 0.000 1.990 0.000 lrucache.py:42(get)
1213166 1.265 0.000 1.265 0.000 {min}
1031737 1.034 0.000 1.034 0.000 {isinstance}
142740 0.563 0.000 0.909 0.000 utils.py:122(find_block_extent)
463844 0.611 0.000 0.611 0.000 utils.py:112(block_to_pixel_coord)
745274 0.565 0.000 0.565 0.000 {method 'append' of 'list' objects}
285478 0.346 0.000 0.346 0.000 {max}
285480 0.346 0.000 0.346 0.000 utils.py:109(pixel_coord_to_block_coord)
324 0.002 0.000 0.188 0.001 utils.py:27(__init__)
324 0.016 0.000 0.186 0.001 gdal.py:848(ReadAsArray)
1 0.000 0.000 0.160 0.160 utils.py:50(__init__)
The top two calls contain the main loop - the entire analyis. The remaining calls sum to less than 625 of the 11644 seconds. Where are the remaining 11,000 seconds spent? Is it all within the main loop of find_pixel_pairs.py? If so, can I find out which lines of code are taking most of the time?
You are right that most of the time is being spent in the phes function on line 49 of find_pixel_pairs.py. To find out more, you need to break up phes into more subfunctions, and then reprofile.
Forget functions and measuring. Use this technique. Just run it in debug mode, and do ctrl-C a few times. The call stack will show you exactly which lines of code are responsible for the time.
Added: For example, pause it 10 times. If, as EOL says, 10400 seconds out of 11000 are being spent directly in phes, then on about 9 of those pauses, it will stop right there.
If, on the other hand, it is spending a large fraction of time in some subroutine called from phes, then you will not only see where it is in that subroutine, but you will see the lines that call it, which are also responsible for the time, and so on, up the call stack.
Don't measure. Capture.
The time spent in the code execution of each function or method is in the tottime column. The cumtime method is tottime+time spent in functions called.
In your listing, you see that the 11,000 seconds that you are looking for are spent directly by the phes function itself. What it calls only takes about 600 seconds.
Thus, you want to find what takes time in phes, by breaking it up into subfunctions and reprofiling, as ~unutbu suggested.
If you've identified potential for bottlenecks in the phes function/method in find_pixel_pairs.py, you can use line_profiler to get line-by-line execution profile performance numbers like these (copied from another question here):
Timer unit: 1e-06 s
Total time: 9e-06 s
File: <ipython-input-4-dae73707787c>
Function: do_other_stuff at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def do_other_stuff(numbers):
5 1 9 9.0 100.0 s = sum(numbers)
Total time: 0.000694 s
File: <ipython-input-4-dae73707787c>
Function: do_stuff at line 7
Line # Hits Time Per Hit % Time Line Contents
==============================================================
7 def do_stuff(numbers):
8 1 12 12.0 1.7 do_other_stuff(numbers)
9 1 208 208.0 30.0 l = [numbers[i]/43 for i in range(len(numbers))]
10 1 474 474.0 68.3 m = ['hello'+str(numbers[i]) for i in range(len(numbers))]
With this information, you do not need to break phes up into multiple subfunctions, because you could see precisely which lines have the highest execution time.
Since you mention that your script is long running, I'd recommend using line_profiler on as limited number of methods as possible, because while profiling adds additional overhead, line profiling can add more.