Performance difference between list comprehensions and for loops - python

I have a script that finds the sum of all numbers that can be written as the sum of fifth powers of their digits. (This problem is described in more detail on the Project Euler web site.)
I have written it two ways, but I do not understand the performance difference.
The first way uses nested list comprehensions:
exp = 5
def min_combo(n):
return ''.join(sorted(list(str(n))))
def fifth_power(n, exp):
return sum([int(x) ** exp for x in list(n)])
print sum( [fifth_power(j,exp) for j in set([min_combo(i) for i in range(101,1000000) ]) if int(j) > 10 and j == min_combo(fifth_power(j,exp)) ] )
and profiles like this:
$ python -m cProfile euler30.py
443839
3039223 function calls in 2.040 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1007801 1.086 0.000 1.721 0.000 euler30.py:10(min_combo)
7908 0.024 0.000 0.026 0.000 euler30.py:14(fifth_power)
1 0.279 0.279 2.040 2.040 euler30.py:6(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1007801 0.175 0.000 0.175 0.000 {method 'join' of 'str' objects}
1 0.013 0.013 0.013 0.013 {range}
1007801 0.461 0.000 0.461 0.000 {sorted}
7909 0.002 0.000 0.002 0.000 {sum}
The second way is the more usual for loop:
exp = 5
ans= 0
def min_combo(n):
return ''.join(sorted(list(str(n))))
def fifth_power(n, exp):
return sum([int(x) ** exp for x in list(n)])
for j in set([ ''.join(sorted(list(str(i)))) for i in range(100, 1000000) ]):
if int(j) > 10:
if j == min_combo(fifth_power(j,exp)):
ans += fifth_power(j,exp)
print 'answer', ans
Here is the profiling info again:
$ python -m cProfile euler30.py
answer 443839
2039325 function calls in 1.709 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
7908 0.024 0.000 0.026 0.000 euler30.py:13(fifth_power)
1 1.081 1.081 1.709 1.709 euler30.py:6(<module>)
7902 0.009 0.000 0.015 0.000 euler30.py:9(min_combo)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1007802 0.147 0.000 0.147 0.000 {method 'join' of 'str' objects}
1 0.013 0.013 0.013 0.013 {range}
1007802 0.433 0.000 0.433 0.000 {sorted}
7908 0.002 0.000 0.002 0.000 {sum}
Why does the list comprehension implementation call min_combo() 1,000,000 more times than the for loop implementation?

Because on the second one you implemented again the content of min_combo inside the set call...
Do the same thing and you'll have the same result.
BTW, change those to avoid big lists being created:
sum([something for foo in bar]) -> sum(something for foo in bar)
set([something for foo in bar]) -> set(something for foo in bar)
(without [...] they become generator expressions).

Related

Numpy array calculations take too long - Bug?

NumPy version: 1.14.5
Purpose of the 'foo' function:
Finding the Euclid distance between arrays with the shapes (1,512), which represent facial features.
Issue:
foo function takes ~223.32 ms , but after that, some background operations related to NumPy take 170 seconds for some reason
Question:
Is keeping arrays in dictionaries, and iterating over them is a very dangerous usage of NumPy arrays?
Request for an Advice:
When I keep the arrays stacked and separate from dict, Euclid distance calculation takes half the time (~120ms instead of ~250ms), but overall performance doesn't change much for some reason. Allocating new arrays and stacking them may have countered the benefits of bigger array calculations.
I am open to any advice.
Code:
import numpy as np
import time
import uuid
import random
from funcy import print_durations
#print_durations
def foo(merged_faces_rec, face):
t = time.time()
for uid, feature_list in merged_faces_rec.items():
dist = np.linalg.norm( np.subtract(feature_list[0], face))
print("foo inside : ", time.time()-t)
rand_age = lambda : random.choice(["0-18", "18-35", "35-55", "55+"])
rand_gender = lambda : random.choice(["Erkek", "Kadin"])
rand_emo = lambda : random.choice(["happy", "sad", "neutral", "scared"])
date_list = []
emb = lambda : np.random.rand(1, 512)
def generate_faces_rec(d, n=12000):
for _ in range(n):
d[uuid.uuid4().hex] = [emb(), rand_gender(), rand_age(), rand_emo(), date_list]
faces_rec1 = dict()
generate_faces_rec(faces_rec1)
faces_rec2 = dict()
generate_faces_rec(faces_rec2)
faces_rec3 = dict()
generate_faces_rec(faces_rec3)
faces_rec4 = dict()
generate_faces_rec(faces_rec4)
faces_rec5 = dict()
generate_faces_rec(faces_rec5)
merged_faces_rec = dict()
st = time.time()
merged_faces_rec.update(faces_rec1)
merged_faces_rec.update(faces_rec2)
merged_faces_rec.update(faces_rec3)
merged_faces_rec.update(faces_rec4)
merged_faces_rec.update(faces_rec5)
t2 = time.time()
print("updates: ", t2-st)
face = list(merged_faces_rec.values())[0][0]
t3 = time.time()
print("face: ", t3-t2)
t4 = time.time()
foo(merged_faces_rec, face)
t5 = time.time()
print("foo: ", t5-t4)
Result:
Computations between t4 and t5 took 168 seconds.
updates: 0.00468754768371582
face: 0.0011434555053710938
foo inside : 0.2232837677001953
223.32 ms in foo({'d02d46999aa145be8116..., [[0.96475353 0.8055263...)
foo: 168.42408967018127
cProfile
python3 -m cProfile -s tottime test.py
cProfile Result:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
30720512 44.991 0.000 85.425 0.000 arrayprint.py:888(__call__)
36791296 42.447 0.000 42.447 0.000 {built-in method numpy.core.multiarray.dragon4_positional}
30840514/60001 36.154 0.000 149.749 0.002 arrayprint.py:659(recurser)
24649728 25.967 0.000 25.967 0.000 {built-in method numpy.core.multiarray.dragon4_scientific}
30720512 20.183 0.000 26.420 0.000 arrayprint.py:636(_extendLine)
10 12.281 1.228 12.281 1.228 {method 'sub' of '_sre.SRE_Pattern' objects}
60001 11.434 0.000 79.370 0.001 arrayprint.py:804(fillFormat)
228330011/228329975 10.270 0.000 10.270 0.000 {built-in method builtins.len}
204081 4.815 0.000 16.469 0.000 {built-in method builtins.max}
18431577 4.624 0.000 21.742 0.000 arrayprint.py:854(<genexpr>)
18431577 4.453 0.000 28.627 0.000 arrayprint.py:859(<genexpr>)
30720531 3.987 0.000 3.987 0.000 {method 'split' of 'str' objects}
12348936 3.012 0.000 13.873 0.000 arrayprint.py:829(<genexpr>)
12348936 3.007 0.000 17.955 0.000 arrayprint.py:832(<genexpr>)
18431577 2.179 0.000 2.941 0.000 arrayprint.py:863(<genexpr>)
18431577 2.124 0.000 2.870 0.000 arrayprint.py:864(<genexpr>)
12348936 1.625 0.000 3.180 0.000 arrayprint.py:833(<genexpr>)
12348936 1.468 0.000 1.992 0.000 arrayprint.py:834(<genexpr>)
12348936 1.433 0.000 1.922 0.000 arrayprint.py:844(<genexpr>)
12348936 1.432 0.000 1.929 0.000 arrayprint.py:837(<genexpr>)
12324864 1.074 0.000 1.074 0.000 {method 'partition' of 'str' objects}
6845518 0.761 0.000 0.761 0.000 {method 'rstrip' of 'str' objects}
60001 0.747 0.000 80.175 0.001 arrayprint.py:777(__init__)
2 0.637 0.319 245.563 122.782 debug.py:237(smart_repr)
120002 0.573 0.000 0.573 0.000 {method 'reduce' of 'numpy.ufunc' objects}
60001 0.421 0.000 231.153 0.004 arrayprint.py:436(_array2string)
60000 0.370 0.000 0.370 0.000 {method 'rand' of 'mtrand.RandomState' objects}
60000 0.303 0.000 232.641 0.004 arrayprint.py:1334(array_repr)
60001 0.274 0.000 232.208 0.004 arrayprint.py:465(array2string)
60001 0.261 0.000 80.780 0.001 arrayprint.py:367(_get_format_function)
120008 0.255 0.000 0.611 0.000 numeric.py:2460(seterr)
Update to Clearify the Question
This is the part that has the bug. Something behind the scenes causes to program to take too long. Is it something to do with garbage collector, or just weird numpy bug? I don't have any clue.
t6 = time.time()
foo1(big_array, face) # 223.32ms
t7 = time.time()
print("foo1 : ", t7-t6) # foo1 : 170 seconds

Why is sorting a python list of tuples faster when I explicitly provide the key as the first element?

Sorting a list of tuples (dictionary keys,values pairs where the key is a random string) is faster when I do not explicitly specify that the key should be used (edit: added operator.itemgetter(0) from comment by #Chepner and the key version is now faster!):
import timeit
setup ="""
import random
import string
random.seed('slartibartfast')
d={}
for i in range(1000):
d[''.join(random.choice(string.ascii_uppercase) for _ in range(16))] = 0
"""
print min(timeit.Timer('for k,v in sorted(d.iteritems()): pass',
setup=setup).repeat(7, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=lambda x: x[0]): pass',
setup=setup).repeat(7, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=operator.itemgetter(0)): pass',
setup=setup).repeat(7, 1000))
Gives:
0.575334150664
0.579534521128
0.523808984422 (the itemgetter version!)
If however I create a custom object passing the key=lambda x: x[0] explicitly to sorted makes it faster:
setup ="""
import random
import string
random.seed('slartibartfast')
d={}
class A(object):
def __init__(self):
self.s = ''.join(random.choice(string.ascii_uppercase) for _ in
range(16))
def __hash__(self): return hash(self.s)
def __eq__(self, other):
return self.s == other.s
def __ne__(self, other): return self.s != other.s
# def __cmp__(self, other): return cmp(self.s ,other.s)
for i in range(1000):
d[A()] = 0
"""
print min(timeit.Timer('for k,v in sorted(d.iteritems()): pass',
setup=setup).repeat(3, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=lambda x: x[0]): pass',
setup=setup).repeat(3, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=operator.itemgetter(0)): pass',
setup=setup).repeat(3, 1000))
Gives:
4.65625458083
1.87191002252
1.78853626684
Is this expected ? Seems like second element of the tuple is used in the second case but shouldn't the keys compare unequal ?
Note: uncommenting the comparison method gives worse results but still the times are at one half:
8.11941771831
5.29207000173
5.25420037046
As expected built in (address comparison) is faster.
EDIT: here are the profiling results from my original code that triggered the question - without the key method:
12739 function calls in 0.007 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.007 0.007 <string>:1(<module>)
1 0.000 0.000 0.007 0.007 __init__.py:6527(_refreshOrder)
1 0.002 0.002 0.006 0.006 {sorted}
4050 0.003 0.000 0.004 0.000 bolt.py:1040(__cmp__) # here is the custom object
4050 0.001 0.000 0.001 0.000 {cmp}
4050 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
291 0.000 0.000 0.000 0.000 __init__.py:6537(<lambda>)
291 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 bolt.py:1240(iteritems)
1 0.000 0.000 0.000 0.000 {method 'iteritems' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
and here are the results when I specify the key:
7027 function calls in 0.004 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.004 0.004 <string>:1(<module>)
1 0.000 0.000 0.004 0.004 __init__.py:6527(_refreshOrder)
1 0.001 0.001 0.003 0.003 {sorted}
2049 0.001 0.000 0.002 0.000 bolt.py:1040(__cmp__)
2049 0.000 0.000 0.000 0.000 {cmp}
2049 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
291 0.000 0.000 0.000 0.000 __init__.py:6538(<lambda>)
291 0.000 0.000 0.000 0.000 __init__.py:6533(<lambda>)
291 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 bolt.py:1240(iteritems)
1 0.000 0.000 0.000 0.000 {method 'iteritems' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Apparently it is the __cmp__ and not the __eq__ method that is called (edit cause that class defines __cmp__ but not __eq__, see here for the order of resolution of equal and compare).
In the code here __eq__ method is indeed called (8605 times) as seen by adding debug prints (see the comments).
So the difference is as stated in the answer by #chepner. The last thing I am not quite clear on is why are those tuple equality calls needed (IOW why we need to call eq and we don't call cmp directly).
FINAL EDIT: I asked this last point here: Why in comparing python tuples of objects is __eq__ and then __cmp__ called? - turns out it's an optimization, tuple's comparison calls __eq__ in the tuple elements, and only call cmp for not eq tuple elements. So this is now perfectly clear. I thought it called directly __cmp__ so initially it seemed to me that specifying the key is just unneeded and after Chepner's answer I was still not getting where the equal calls come in.
Gist: https://gist.github.com/Utumno/f3d25e0fe4bd0f43ceb9178a60181a53
There are two issues at play.
Comparing two values of builtin types (such as int) happens in C. Comparing two values of a class with an __eq__ method happens in Python; repeatedly calling __eq__ imposes a significant performance penalty.
The function passed with key is called once per element, rather than once per comparison. This means that lambda x: x[0] is called once to build a list of A instances to be compared. Without key, you need to make O(n lg n) tuple comparisons, each of which requires a call to A.__eq__ to compare the first element of each tuple.
The first explains why your first pair of results is under a second while the second takes several seconds. The second explains why using key is faster regardless of the values being compared.

Explain {isinstance} in iPython prun output?

I'm trying to profile a few lines of Pandas code, and when I run %prun i'm finding most of my time is taken by {isinstance}. This seems to happen a lot -- can anyone suggest what that means and, for bonus points, suggest a way to avoid it?
This isn't meant to be application specific, but here's a thinned out version of the code if that's important:
def flagOtherGroup(df):
try:mostUsed0 = df[df.subGroupDummy == 0].siteid.iloc[0]
except: mostUsed0 = -1
try: mostUsed1 = df[df.subGroupDummy == 1].siteid.iloc[0]
except: mostUsed1 = -1
df['mostUsed'] = 0
df.loc[(df.subGroupDummy == 0) & (df.siteid == mostUsed1), 'mostUsed'] = 1
df.loc[(df.subGroupDummy == 1) & (df.siteid == mostUsed0), 'mostUsed'] = 1
return df[['mostUsed']]
%prun -l15 temp = test.groupby('userCode').apply(flagOtherGroup)
And top lines of prun:
Ordered by: internal time
List reduced from 531 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
834472 1.908 0.000 2.280 0.000 {isinstance}
497048/395400 1.192 0.000 1.572 0.000 {len}
32722 0.879 0.000 4.479 0.000 series.py:114(__init__)
34444 0.613 0.000 1.792 0.000 internals.py:3286(__init__)
25990 0.568 0.000 0.568 0.000 {method 'reduce' of 'numpy.ufunc' objects}
82266/78821 0.549 0.000 0.744 0.000 {numpy.core.multiarray.array}
42201 0.544 0.000 1.195 0.000 internals.py:62(__init__)
42201 0.485 0.000 1.812 0.000 internals.py:2015(make_block)
166244 0.476 0.000 0.615 0.000 {getattr}
4310 0.455 0.000 1.121 0.000 internals.py:2217(_rebuild_blknos_and_blklocs)
12054 0.417 0.000 2.134 0.000 internals.py:2355(apply)
9474 0.385 0.000 1.284 0.000 common.py:727(take_nd)
isinstance, len and getattr are just the built-in functions. There are a huge number of calls to the isinstance() function here; it is not that the call itself takes a lot of time, but the function was used 834472 times.
Presumably it is the pandas code that uses it.

Why that version of mergesort is faster

Based on that answer here are two versions of merge function used for mergesort.
Could you help me to understand why the second one is much faster.
I have tested it for list of 50000 and the second one is 8 times faster (Gist).
def merge1(left, right):
i = j = inv = 0
merged = []
while i < len(left) and j < len(right):
if left[i] <= right[j]:
merged.append(left[i])
i += 1
else:
merged.append(right[j])
j += 1
inv += len(left[i:])
merged += left[i:]
merged += right[j:]
return merged, inv
.
def merge2(array1, array2):
inv = 0
merged_array = []
while array1 or array2:
if not array1:
merged_array.append(array2.pop())
elif (not array2) or array1[-1] > array2[-1]:
merged_array.append(array1.pop())
inv += len(array2)
else:
merged_array.append(array2.pop())
merged_array.reverse()
return merged_array, inv
Here is the sort function:
def _merge_sort(list, merge):
len_list = len(list)
if len_list < 2:
return list, 0
middle = len_list / 2
left, left_inv = _merge_sort(list[:middle], merge)
right, right_inv = _merge_sort(list[middle:], merge)
l, merge_inv = merge(left, right)
inv = left_inv + right_inv + merge_inv
return l, inv
.
import numpy.random as nprnd
test_list = nprnd.randint(1000, size=50000).tolist()
test_list_tmp = list(test_list)
merge_sort(test_list_tmp, merge1)
test_list_tmp = list(test_list)
merge_sort(test_list_tmp, merge2)
Similar answer as kreativitea's above, but with more info (i think!)
So profiling the actual merge functions, for the merging of two 50K arrays,
merge 1
311748 function calls in 15.363 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 15.363 15.363 <string>:1(<module>)
1 15.322 15.322 15.362 15.362 merge.py:3(merge1)
221309 0.030 0.000 0.030 0.000 {len}
90436 0.010 0.000 0.010 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
merge2
250004 function calls in 0.104 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.104 0.104 <string>:1(<module>)
1 0.074 0.074 0.103 0.103 merge.py:20(merge2)
50000 0.005 0.000 0.005 0.000 {len}
100000 0.010 0.000 0.010 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000 0.014 0.000 0.014 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
So for merge1, it's 221309 len, 90436 append, and takes 15.363 seconds.
So for merge2, it's 50000 len, 100000 append, and 100000 pop and takes 0.104 seconds.
len and append pop are all O(1) (more info here), so these profiles aren't showing what's actually taking the time, since going of just that, it should be faster, but only ~20% so.
Okay the cause is actually fairly obvious if you just read the code:
In the first method, there is this line:
inv += len(left[i:])
so every time that is called, it has to rebuild an array. If you comment out this line (or just replace it with inv += 1 or something) then it becomes faster than the other method. This is the single line responsible for the increased time.
Having noticed this is the cause, the issue can be fixed by improving the code; change it to this for a speed up. After doing this, it will be faster than merge2
inv += len(left) - i
Update it to this:
def merge3(left, right):
i = j = inv = 0
merged = []
while i < len(left) and j < len(right):
if left[i] <= right[j]:
merged.append(left[i])
i += 1
else:
merged.append(right[j])
j += 1
inv += len(left) - i
merged += left[i:]
merged += right[j:]
return merged, inv
You can use the excellent cProfile module to help you solve things like this.
>>> import cProfile
>>> a = range(1,20000,2)
>>> b = range(0,20000,2)
>>> cProfile.run('merge1(a, b)')
70002 function calls in 0.195 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.184 0.184 0.195 0.195 <pyshell#7>:1(merge1)
1 0.000 0.000 0.195 0.195 <string>:1(<module>)
50000 0.008 0.000 0.008 0.000 {len}
19999 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
>>> cProfile.run('merge2(a, b)')
50004 function calls in 0.026 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.016 0.016 0.026 0.026 <pyshell#12>:1(merge2)
1 0.000 0.000 0.026 0.026 <string>:1(<module>)
10000 0.002 0.000 0.002 0.000 {len}
20000 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
20000 0.005 0.000 0.005 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
After looking at the information a bit, it looks like the commenters are correct-- its not the len function-- it's the string module. The string module is invoked when you compare the length of things, as follows:
>>> cProfile.run('0 < len(c)')
3 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
It is also invoked when slicing a list, but this is a very quick operation.
>>> len(c)
20000000
>>> cProfile.run('c[3:2000000]')
2 function calls in 0.011 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.011 0.011 0.011 0.011 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
TL;DR: Something in the string module is taking 0.195s in your first function, and 0.026s in your second function. : apparently, the rebuilding of the array in inv += len(left[i:]) this line.
If I had to guess, I would say it probably has to do with the cost of removing elements from a list, removing from the end (pop) is quicker than removing from the beginning. the second favors removing elements from the end of the list.
See Performance Notes: http://effbot.org/zone/python-list.htm
"The time needed to remove an item is about the same as the time needed to insert an item at the same location; removing items at the end is fast, removing items at the beginning is slow."

Increasing the depth of cProfiler in Python to report more functions?

I'm trying to profile a function that calls other functions. I call the profiler as follows:
from mymodule import foo
def start():
# ...
foo()
import cProfile as profile
profile.run('start()', output_file)
p = pstats.Stats(output_file)
print "name: "
print p.sort_stats('name')
print "all stats: "
p.print_stats()
print "cumulative (top 10): "
p.sort_stats('cumulative').print_stats(10)
I find that the profiler says all the time was spend in function "foo()" of mymodule, instead of brekaing it down into the subfunctions foo() calls, which is what I want to see. How can I make the profiler report the performance of these functions?
thanks.
You need p.print_callees() to get hierarchical breakdown of method calls. The output is quite self explanatory: On the left column you can find your function of interest e.g.foo(), then going to the right side column shows all called sub-functions and their scoped total and cumulative times. Breakdowns for these sub-calls are also included etc.
First, I want to say that I was unable to replicate the Asker's issue. The profiler (in py2.7) definitely descends into the called functions and methods. (The docs for py3.6 look identical, but I haven't tested on py3.) My guess is that by limiting it to the top 10 returns, sorted by cumulative time, the first N of those were very high-level functions called a minimum of time, and the functions called by foo() dropped off the bottom of the list.
I decided to play with some big numbers for testing. Here's my test code:
# file: mymodule.py
import math
def foo(n = 5):
for i in xrange(1,n):
baz(i)
bar(i ** i)
def bar(n):
for i in xrange(1,n):
e = exp200(i)
print "len e: ", len("{}".format(e))
def exp200(n):
result = 1
for i in xrange(200):
result *= n
return result
def baz(n):
print "{}".format(n)
And the including file (very similiar to Asker's):
# file: test.py
from mymodule import foo
def start():
# ...
foo(8)
OUTPUT_FILE = 'test.profile_info'
import pstats
import cProfile as profile
profile.run('start()', OUTPUT_FILE)
p = pstats.Stats(OUTPUT_FILE)
print "name: "
print p.sort_stats('name')
print "all stats: "
p.print_stats()
print "cumulative (top 10): "
p.sort_stats('cumulative').print_stats(10)
print "time (top 10): "
p.sort_stats('time').print_stats(10)
Notice the last line. I added a view sorted by time, which is the total time spent in the function "excluding time made in calls to sub-functions". I find this view much more useful, as it tends to favor the functions that are doing actual work, and may be in need of optimization.
Here's the part of the results that the Asker was working from (cumulative-sorted):
cumulative (top 10):
Thu Mar 24 21:26:32 2016 test.profile_info
2620840 function calls in 76.039 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 76.039 76.039 <string>:1(<module>)
1 0.000 0.000 76.039 76.039 test.py:5(start)
1 0.000 0.000 76.039 76.039 /Users/jhazen/mymodule.py:4(foo)
7 10.784 1.541 76.039 10.863 /Users/jhazen/mymodule.py:10(bar)
873605 49.503 0.000 49.503 0.000 /Users/jhazen/mymodule.py:15(exp200)
873612 15.634 0.000 15.634 0.000 {method 'format' of 'str' objects}
873605 0.118 0.000 0.118 0.000 {len}
7 0.000 0.000 0.000 0.000 /Users/jhazen/mymodule.py:21(baz)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
See how the top 3 functions in this display were only called once. Let's look at the time-sorted view:
time (top 10):
Thu Mar 24 21:26:32 2016 test.profile_info
2620840 function calls in 76.039 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
873605 49.503 0.000 49.503 0.000 /Users/jhazen/mymodule.py:15(exp200)
873612 15.634 0.000 15.634 0.000 {method 'format' of 'str' objects}
7 10.784 1.541 76.039 10.863 /Users/jhazen/mymodule.py:10(bar)
873605 0.118 0.000 0.118 0.000 {len}
7 0.000 0.000 0.000 0.000 /Users/jhazen/mymodule.py:21(baz)
1 0.000 0.000 76.039 76.039 /Users/jhazen/mymodule.py:4(foo)
1 0.000 0.000 76.039 76.039 test.py:5(start)
1 0.000 0.000 76.039 76.039 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Now the number one entry makes sense. Obviously raising something to the 200th power by repeated multiplication is a "naive" strategy. Let's replace it:
def exp200(n):
return n ** 200
And the results:
time (top 10):
Thu Mar 24 21:32:18 2016 test.profile_info
2620840 function calls in 30.646 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
873612 15.722 0.000 15.722 0.000 {method 'format' of 'str' objects}
7 9.760 1.394 30.646 4.378 /Users/jhazen/mymodule.py:10(bar)
873605 5.056 0.000 5.056 0.000 /Users/jhazen/mymodule.py:15(exp200)
873605 0.108 0.000 0.108 0.000 {len}
7 0.000 0.000 0.000 0.000 /Users/jhazen/mymodule.py:18(baz)
1 0.000 0.000 30.646 30.646 /Users/jhazen/mymodule.py:4(foo)
1 0.000 0.000 30.646 30.646 test.py:5(start)
1 0.000 0.000 30.646 30.646 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
That's a nice improvement in time. Now str.format() is our worst offender. I added the line in bar() to print the length of the number, because my first attempt (just computing the number and doing nothing with it) got optimized away, and my attempt to avoid that (printing the number, which got really big really fast) seemed like it might be blocking on I/O, so I compromised on printing the length of the number. Hey, that's the base-10 log. Let's try that:
def bar(n):
for i in xrange(1,n):
e = exp200(i)
print "log e: ", math.log10(e)
And the results:
time (top 10):
Thu Mar 24 21:40:16 2016 test.profile_info
1747235 function calls in 11.279 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
7 6.082 0.869 11.279 1.611 /Users/jhazen/mymodule.py:10(bar)
873605 4.996 0.000 4.996 0.000 /Users/jhazen/mymodule.py:15(exp200)
873605 0.201 0.000 0.201 0.000 {math.log10}
7 0.000 0.000 0.000 0.000 /Users/jhazen/mymodule.py:18(baz)
1 0.000 0.000 11.279 11.279 /Users/jhazen/mymodule.py:4(foo)
7 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
1 0.000 0.000 11.279 11.279 test.py:5(start)
1 0.000 0.000 11.279 11.279 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Hmm, still a fair amount of time spent in bar(), even without the str.format(). Let's get rid of that print:
def bar(n):
z = 0
for i in xrange(1,n):
e = exp200(i)
z += math.log10(e)
return z
And the results:
time (top 10):
Thu Mar 24 21:45:24 2016 test.profile_info
1747235 function calls in 5.031 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
873605 4.487 0.000 4.487 0.000 /Users/jhazen/mymodule.py:17(exp200)
7 0.440 0.063 5.031 0.719 /Users/jhazen/mymodule.py:10(bar)
873605 0.104 0.000 0.104 0.000 {math.log10}
7 0.000 0.000 0.000 0.000 /Users/jhazen/mymodule.py:20(baz)
1 0.000 0.000 5.031 5.031 /Users/jhazen/mymodule.py:4(foo)
7 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
1 0.000 0.000 5.031 5.031 test.py:5(start)
1 0.000 0.000 5.031 5.031 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Now it looks like the stuff doing the actual work is the busiest function, so I think we're done optimizing.
Hope that helps!
Maybe you faced with a similar problem, so I'm going to describe here my issue. My profiling code looked like this:
def foobar():
import cProfile
pr = cProfile.Profile()
pr.enable()
for event in reader.events():
baz()
# and other things
pr.disable()
pr.dump_stats('result.prof')
And the final profiling output contained only events() call. And I spent not so little time to realise that I had empty loop profiling. Of course, there was more than one call of foobar() from a client code, but meaningful profiling results had been overwritten by last one call with empty loop.

Categories