NumPy version: 1.14.5
Purpose of the 'foo' function:
Finding the Euclid distance between arrays with the shapes (1,512), which represent facial features.
Issue:
foo function takes ~223.32 ms , but after that, some background operations related to NumPy take 170 seconds for some reason
Question:
Is keeping arrays in dictionaries, and iterating over them is a very dangerous usage of NumPy arrays?
Request for an Advice:
When I keep the arrays stacked and separate from dict, Euclid distance calculation takes half the time (~120ms instead of ~250ms), but overall performance doesn't change much for some reason. Allocating new arrays and stacking them may have countered the benefits of bigger array calculations.
I am open to any advice.
Code:
import numpy as np
import time
import uuid
import random
from funcy import print_durations
#print_durations
def foo(merged_faces_rec, face):
t = time.time()
for uid, feature_list in merged_faces_rec.items():
dist = np.linalg.norm( np.subtract(feature_list[0], face))
print("foo inside : ", time.time()-t)
rand_age = lambda : random.choice(["0-18", "18-35", "35-55", "55+"])
rand_gender = lambda : random.choice(["Erkek", "Kadin"])
rand_emo = lambda : random.choice(["happy", "sad", "neutral", "scared"])
date_list = []
emb = lambda : np.random.rand(1, 512)
def generate_faces_rec(d, n=12000):
for _ in range(n):
d[uuid.uuid4().hex] = [emb(), rand_gender(), rand_age(), rand_emo(), date_list]
faces_rec1 = dict()
generate_faces_rec(faces_rec1)
faces_rec2 = dict()
generate_faces_rec(faces_rec2)
faces_rec3 = dict()
generate_faces_rec(faces_rec3)
faces_rec4 = dict()
generate_faces_rec(faces_rec4)
faces_rec5 = dict()
generate_faces_rec(faces_rec5)
merged_faces_rec = dict()
st = time.time()
merged_faces_rec.update(faces_rec1)
merged_faces_rec.update(faces_rec2)
merged_faces_rec.update(faces_rec3)
merged_faces_rec.update(faces_rec4)
merged_faces_rec.update(faces_rec5)
t2 = time.time()
print("updates: ", t2-st)
face = list(merged_faces_rec.values())[0][0]
t3 = time.time()
print("face: ", t3-t2)
t4 = time.time()
foo(merged_faces_rec, face)
t5 = time.time()
print("foo: ", t5-t4)
Result:
Computations between t4 and t5 took 168 seconds.
updates: 0.00468754768371582
face: 0.0011434555053710938
foo inside : 0.2232837677001953
223.32 ms in foo({'d02d46999aa145be8116..., [[0.96475353 0.8055263...)
foo: 168.42408967018127
cProfile
python3 -m cProfile -s tottime test.py
cProfile Result:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
30720512 44.991 0.000 85.425 0.000 arrayprint.py:888(__call__)
36791296 42.447 0.000 42.447 0.000 {built-in method numpy.core.multiarray.dragon4_positional}
30840514/60001 36.154 0.000 149.749 0.002 arrayprint.py:659(recurser)
24649728 25.967 0.000 25.967 0.000 {built-in method numpy.core.multiarray.dragon4_scientific}
30720512 20.183 0.000 26.420 0.000 arrayprint.py:636(_extendLine)
10 12.281 1.228 12.281 1.228 {method 'sub' of '_sre.SRE_Pattern' objects}
60001 11.434 0.000 79.370 0.001 arrayprint.py:804(fillFormat)
228330011/228329975 10.270 0.000 10.270 0.000 {built-in method builtins.len}
204081 4.815 0.000 16.469 0.000 {built-in method builtins.max}
18431577 4.624 0.000 21.742 0.000 arrayprint.py:854(<genexpr>)
18431577 4.453 0.000 28.627 0.000 arrayprint.py:859(<genexpr>)
30720531 3.987 0.000 3.987 0.000 {method 'split' of 'str' objects}
12348936 3.012 0.000 13.873 0.000 arrayprint.py:829(<genexpr>)
12348936 3.007 0.000 17.955 0.000 arrayprint.py:832(<genexpr>)
18431577 2.179 0.000 2.941 0.000 arrayprint.py:863(<genexpr>)
18431577 2.124 0.000 2.870 0.000 arrayprint.py:864(<genexpr>)
12348936 1.625 0.000 3.180 0.000 arrayprint.py:833(<genexpr>)
12348936 1.468 0.000 1.992 0.000 arrayprint.py:834(<genexpr>)
12348936 1.433 0.000 1.922 0.000 arrayprint.py:844(<genexpr>)
12348936 1.432 0.000 1.929 0.000 arrayprint.py:837(<genexpr>)
12324864 1.074 0.000 1.074 0.000 {method 'partition' of 'str' objects}
6845518 0.761 0.000 0.761 0.000 {method 'rstrip' of 'str' objects}
60001 0.747 0.000 80.175 0.001 arrayprint.py:777(__init__)
2 0.637 0.319 245.563 122.782 debug.py:237(smart_repr)
120002 0.573 0.000 0.573 0.000 {method 'reduce' of 'numpy.ufunc' objects}
60001 0.421 0.000 231.153 0.004 arrayprint.py:436(_array2string)
60000 0.370 0.000 0.370 0.000 {method 'rand' of 'mtrand.RandomState' objects}
60000 0.303 0.000 232.641 0.004 arrayprint.py:1334(array_repr)
60001 0.274 0.000 232.208 0.004 arrayprint.py:465(array2string)
60001 0.261 0.000 80.780 0.001 arrayprint.py:367(_get_format_function)
120008 0.255 0.000 0.611 0.000 numeric.py:2460(seterr)
Update to Clearify the Question
This is the part that has the bug. Something behind the scenes causes to program to take too long. Is it something to do with garbage collector, or just weird numpy bug? I don't have any clue.
t6 = time.time()
foo1(big_array, face) # 223.32ms
t7 = time.time()
print("foo1 : ", t7-t6) # foo1 : 170 seconds
Related
Using numpy.reshape helped a lot and using map helped a little. Is it possible to speed this up some more?
import pydicom
import numpy as np
import cProfile
import pstats
def parse_coords(contour):
"""Given a contour from a DICOM ROIContourSequence, returns coordinates
[loop][[x0, x1, x2, ...][y0, y1, y2, ...][z0, z1, z2, ...]]"""
if not hasattr(contour, "ContourSequence"):
return [] # empty structure
def _reshape_contour_data(loop):
return np.reshape(np.array(loop.ContourData),
(3, len(loop.ContourData) // 3),
order='F')
return list(map(_reshape_contour_data,contour.ContourSequence))
def profile_load_contours():
rs = pydicom.dcmread('RS.gyn1.dcm')
structs = [parse_coords(contour) for contour in rs.ROIContourSequence]
cProfile.run('profile_load_contours()','prof.stats')
p = pstats.Stats('prof.stats')
p.sort_stats('cumulative').print_stats(30)
Using a real structure set exported from Varian Eclipse.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 12.165 12.165 {built-in method builtins.exec}
1 0.151 0.151 12.165 12.165 <string>:1(<module>)
1 0.000 0.000 12.014 12.014 load_contour_time.py:19(profile_load_contours)
1 0.000 0.000 11.983 11.983 load_contour_time.py:21(<listcomp>)
56 0.009 0.000 11.983 0.214 load_contour_time.py:7(parse_coords)
50745/33837 0.129 0.000 11.422 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:455(__getattr__)
50741/33825 0.152 0.000 10.938 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:496(__getitem__)
16864 0.069 0.000 9.839 0.001 load_contour_time.py:12(_reshape_contour_data)
16915 0.101 0.000 9.780 0.001 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataelem.py:439(DataElement_from_raw)
16915 0.052 0.000 9.300 0.001 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:320(convert_value)
16864 0.038 0.000 7.099 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:89(convert_DS_string)
16870 0.042 0.000 7.010 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/valuerep.py:495(MultiString)
16908 1.013 0.000 6.826 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/multival.py:29(__init__)
3004437 3.013 0.000 5.577 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/multival.py:42(number_string_type_constructor)
3038317/3038231 1.037 0.000 3.171 0.000 {built-in method builtins.hasattr}
Much of the time is in convert_DS_string. Is it possible to make it faster? I guess part of the problem is that the coordinates are not stored very efficiently in the DICOM file.
EDIT:
As a way of avoiding the loop at the end of MultiVal.__init__ I am wondering about getting the raw double string of each ContourData and using numpy.fromstring on it. However, I have not been able to get the raw double string.
Eliminating the loop in MultiVal.__init__ and using numpy.fromstring provides more than 4 times speedup. I will post on the pydicom github see if there is some interest in taking this into the library code. It is a little ugly. I would welcome advice on further improvement.
import pydicom
import numpy as np
import cProfile
import pstats
def parse_coords(contour):
"""Given a contour from a DICOM ROIContourSequence, returns coordinates
[loop][[x0, x1, x2, ...][y0, y1, y2, ...][z0, z1, z2, ...]]"""
if not hasattr(contour, "ContourSequence"):
return [] # empty structure
cd_tag = pydicom.tag.Tag(0x3006, 0x0050) # ContourData tag
def _reshape_contour_data(loop):
val = super(loop.__class__, loop).__getitem__(cd_tag).value
try:
double_string = val.decode(encoding='utf-8')
double_vec = np.fromstring(double_string, dtype=float, sep=chr(92)) # 92 is '/'
except AttributeError: # 'MultiValue' has no 'decode' (bytes does)
# It's already been converted to doubles and cached
double_vec = loop.ContourData
return np.reshape(np.array(double_vec),
(3, len(double_vec) // 3),
order='F')
return list(map(_reshape_contour_data, contour.ContourSequence))
def profile_load_contours():
rs = pydicom.dcmread('RS.gyn1.dcm')
structs = [parse_coords(contour) for contour in rs.ROIContourSequence]
profile_load_contours()
cProfile.run('profile_load_contours()','prof.stats')
p = pstats.Stats('prof.stats')
p.sort_stats('cumulative').print_stats(15)
Result
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.800 2.800 {built-in method builtins.exec}
1 0.017 0.017 2.800 2.800 <string>:1(<module>)
1 0.000 0.000 2.783 2.783 load_contour_time3.py:29(profile_load_contours)
1 0.000 0.000 2.761 2.761 load_contour_time3.py:31(<listcomp>)
56 0.006 0.000 2.760 0.049 load_contour_time3.py:9(parse_coords)
153/109 0.001 0.000 2.184 0.020 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:455(__getattr__)
149/97 0.001 0.000 2.182 0.022 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:496(__getitem__)
51 0.000 0.000 2.178 0.043 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataelem.py:439(DataElement_from_raw)
51 0.000 0.000 2.177 0.043 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:320(convert_value)
44 0.000 0.000 2.176 0.049 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:255(convert_SQ)
44 0.035 0.001 2.176 0.049 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/filereader.py:427(read_sequence)
152/66 0.000 0.000 2.171 0.033 {built-in method builtins.hasattr}
16920 0.147 0.000 1.993 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/filereader.py:452(read_sequence_item)
16923 0.116 0.000 1.267 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/filereader.py:365(read_dataset)
84616 0.113 0.000 0.699 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:960(__setattr__)
Here is the rate limiting function in my code
def timepropagate(wv1, ham11,
ham12, ham22, scalararray, nt):
wv2 = np.zeros((nx, ny), 'c16')
fw1 = np.zeros((nx, ny), 'c16')
fw2 = np.zeros((nx, ny), 'c16')
for t in range(0, nt, 1):
wv1, wv2 = scalararray*wv1, scalararray*wv2
fw1, fw2 = (np.fft.fft2(wv1), np.fft.fft2(wv2))
fw1 = ham11*fw1+ham12*fw2
fw2 = ham12*fw1+ham22*fw2
wv1, wv2 = (np.fft.ifft2(fw1), np.fft.ifft2(fw2))
wv1, wv2 = scalararray*wv1, scalararray*wv2
del(fw1)
del(fw2)
return np.array([wv1, wv2])
What I would need to do is find a reasonably fast implementation that would allow me to go at twice the speed, preferably the fastest.
The more general question I'm interested in, is what way can I speed up this piece, using minimal possible connections back to python. I assume that even if I speed up specific segments of the code, say the scalar array multiplications, I would still come back and go from python at the Fourier transforms which would take time. Are there any ways I can use, say numba or cython and not make this "coming back" to python in the middle of the loops?
On a personal note, I'd prefer something fast on a single thread considering that I'd be using my other threads already.
Edit: here are results of profiling, the 1st one for 4096x4096 arrays for 10 time steps, I need to scale it up for nt = 8000.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.099 0.099 432.556 432.556 <string>:1(<module>)
40 0.031 0.001 28.792 0.720 fftpack.py:100(fft)
40 45.867 1.147 68.055 1.701 fftpack.py:195(ifft)
80 0.236 0.003 47.647 0.596 fftpack.py:46(_raw_fft)
40 0.102 0.003 1.260 0.032 fftpack.py:598(_cook_nd_args)
40 1.615 0.040 99.774 2.494 fftpack.py:617(_raw_fftnd)
20 0.225 0.011 29.739 1.487 fftpack.py:819(fft2)
20 2.252 0.113 72.512 3.626 fftpack.py:908(ifft2)
80 0.000 0.000 0.000 0.000 fftpack.py:93(_unitary)
40 0.631 0.016 0.820 0.021 fromnumeric.py:43(_wrapit)
80 0.009 0.000 0.009 0.000 fromnumeric.py:457(swapaxes)
40 0.338 0.008 1.158 0.029 fromnumeric.py:56(take)
200 0.064 0.000 0.219 0.001 numeric.py:414(asarray)
1 329.728 329.728 432.458 432.458 profiling.py:86(timepropagate)
1 0.036 0.036 432.592 432.592 {built-in method builtins.exec}
40 0.001 0.000 0.001 0.000 {built-in method builtins.getattr}
120 0.000 0.000 0.000 0.000 {built-in method builtins.len}
241 3.930 0.016 3.930 0.016 {built-in method numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}
40 18.861 0.472 18.861 0.472 {built-in method numpy.fft.fftpack_lite.cfftb}
40 28.539 0.713 28.539 0.713 {built-in method numpy.fft.fftpack_lite.cfftf}
1 0.000 0.000 0.000 0.000 {built-in method numpy.fft.fftpack_lite.cffti}
80 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
40 0.006 0.000 0.006 0.000 {method 'astype' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
80 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
40 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
80 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
80 0.001 0.000 0.001 0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
40 0.022 0.001 0.022 0.001 {method 'take' of 'numpy.ndarray' objects}
I think I've done it wrong the first time, using time.time() to calculate time differences for small arrays and extrapolating the conclusions for larger ones.
If most of the time is spent in the hamiltonian multiplication, you may want to apply numba on that part. The most benefit coming from removing all the temporal arrays that would be needed if evaluating expressions from within NumPy.
Bear also in mind that the arrays (4096, 4096, c16) are big enough to not fit comfortably in the processor caches. A single matrix would take 256 MiB. So think that the performance is unlikely to be related at all with the operations, but rather on the bandwidth. So implement those operations in a way that you only perform one pass in the input operands. This is really trivial to implement in numba. Note: You will only need to implement in numba the hamiltonian expressions.
I want also to point out that the "preallocations" using np.zeros seems to signal that your code is not following your intent as:
fw1 = ham11*fw1+ham12*fw2
fw2 = ham12*fw1+ham22*fw2
will actually create new arrays for fw1, fw2. If your intent was to reuse the buffer, you may want to use "fw1[:,:] = ...". Otherwise the np.zeros do nothing but waste time and memory.
You may want to consider to join (wv1, wv2) into a (2, 4096, 4096, c16) array. The same with (fw1, fw2). That way code will be simpler as you can rely on broadcasting to handle the "scalararray" product. fft2 and ifft2 will actually do the right thing (AFAIK).
I'm trying to profile a few lines of Pandas code, and when I run %prun i'm finding most of my time is taken by {isinstance}. This seems to happen a lot -- can anyone suggest what that means and, for bonus points, suggest a way to avoid it?
This isn't meant to be application specific, but here's a thinned out version of the code if that's important:
def flagOtherGroup(df):
try:mostUsed0 = df[df.subGroupDummy == 0].siteid.iloc[0]
except: mostUsed0 = -1
try: mostUsed1 = df[df.subGroupDummy == 1].siteid.iloc[0]
except: mostUsed1 = -1
df['mostUsed'] = 0
df.loc[(df.subGroupDummy == 0) & (df.siteid == mostUsed1), 'mostUsed'] = 1
df.loc[(df.subGroupDummy == 1) & (df.siteid == mostUsed0), 'mostUsed'] = 1
return df[['mostUsed']]
%prun -l15 temp = test.groupby('userCode').apply(flagOtherGroup)
And top lines of prun:
Ordered by: internal time
List reduced from 531 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
834472 1.908 0.000 2.280 0.000 {isinstance}
497048/395400 1.192 0.000 1.572 0.000 {len}
32722 0.879 0.000 4.479 0.000 series.py:114(__init__)
34444 0.613 0.000 1.792 0.000 internals.py:3286(__init__)
25990 0.568 0.000 0.568 0.000 {method 'reduce' of 'numpy.ufunc' objects}
82266/78821 0.549 0.000 0.744 0.000 {numpy.core.multiarray.array}
42201 0.544 0.000 1.195 0.000 internals.py:62(__init__)
42201 0.485 0.000 1.812 0.000 internals.py:2015(make_block)
166244 0.476 0.000 0.615 0.000 {getattr}
4310 0.455 0.000 1.121 0.000 internals.py:2217(_rebuild_blknos_and_blklocs)
12054 0.417 0.000 2.134 0.000 internals.py:2355(apply)
9474 0.385 0.000 1.284 0.000 common.py:727(take_nd)
isinstance, len and getattr are just the built-in functions. There are a huge number of calls to the isinstance() function here; it is not that the call itself takes a lot of time, but the function was used 834472 times.
Presumably it is the pandas code that uses it.
I have a script that finds the sum of all numbers that can be written as the sum of fifth powers of their digits. (This problem is described in more detail on the Project Euler web site.)
I have written it two ways, but I do not understand the performance difference.
The first way uses nested list comprehensions:
exp = 5
def min_combo(n):
return ''.join(sorted(list(str(n))))
def fifth_power(n, exp):
return sum([int(x) ** exp for x in list(n)])
print sum( [fifth_power(j,exp) for j in set([min_combo(i) for i in range(101,1000000) ]) if int(j) > 10 and j == min_combo(fifth_power(j,exp)) ] )
and profiles like this:
$ python -m cProfile euler30.py
443839
3039223 function calls in 2.040 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1007801 1.086 0.000 1.721 0.000 euler30.py:10(min_combo)
7908 0.024 0.000 0.026 0.000 euler30.py:14(fifth_power)
1 0.279 0.279 2.040 2.040 euler30.py:6(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1007801 0.175 0.000 0.175 0.000 {method 'join' of 'str' objects}
1 0.013 0.013 0.013 0.013 {range}
1007801 0.461 0.000 0.461 0.000 {sorted}
7909 0.002 0.000 0.002 0.000 {sum}
The second way is the more usual for loop:
exp = 5
ans= 0
def min_combo(n):
return ''.join(sorted(list(str(n))))
def fifth_power(n, exp):
return sum([int(x) ** exp for x in list(n)])
for j in set([ ''.join(sorted(list(str(i)))) for i in range(100, 1000000) ]):
if int(j) > 10:
if j == min_combo(fifth_power(j,exp)):
ans += fifth_power(j,exp)
print 'answer', ans
Here is the profiling info again:
$ python -m cProfile euler30.py
answer 443839
2039325 function calls in 1.709 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
7908 0.024 0.000 0.026 0.000 euler30.py:13(fifth_power)
1 1.081 1.081 1.709 1.709 euler30.py:6(<module>)
7902 0.009 0.000 0.015 0.000 euler30.py:9(min_combo)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1007802 0.147 0.000 0.147 0.000 {method 'join' of 'str' objects}
1 0.013 0.013 0.013 0.013 {range}
1007802 0.433 0.000 0.433 0.000 {sorted}
7908 0.002 0.000 0.002 0.000 {sum}
Why does the list comprehension implementation call min_combo() 1,000,000 more times than the for loop implementation?
Because on the second one you implemented again the content of min_combo inside the set call...
Do the same thing and you'll have the same result.
BTW, change those to avoid big lists being created:
sum([something for foo in bar]) -> sum(something for foo in bar)
set([something for foo in bar]) -> set(something for foo in bar)
(without [...] they become generator expressions).
Based on that answer here are two versions of merge function used for mergesort.
Could you help me to understand why the second one is much faster.
I have tested it for list of 50000 and the second one is 8 times faster (Gist).
def merge1(left, right):
i = j = inv = 0
merged = []
while i < len(left) and j < len(right):
if left[i] <= right[j]:
merged.append(left[i])
i += 1
else:
merged.append(right[j])
j += 1
inv += len(left[i:])
merged += left[i:]
merged += right[j:]
return merged, inv
.
def merge2(array1, array2):
inv = 0
merged_array = []
while array1 or array2:
if not array1:
merged_array.append(array2.pop())
elif (not array2) or array1[-1] > array2[-1]:
merged_array.append(array1.pop())
inv += len(array2)
else:
merged_array.append(array2.pop())
merged_array.reverse()
return merged_array, inv
Here is the sort function:
def _merge_sort(list, merge):
len_list = len(list)
if len_list < 2:
return list, 0
middle = len_list / 2
left, left_inv = _merge_sort(list[:middle], merge)
right, right_inv = _merge_sort(list[middle:], merge)
l, merge_inv = merge(left, right)
inv = left_inv + right_inv + merge_inv
return l, inv
.
import numpy.random as nprnd
test_list = nprnd.randint(1000, size=50000).tolist()
test_list_tmp = list(test_list)
merge_sort(test_list_tmp, merge1)
test_list_tmp = list(test_list)
merge_sort(test_list_tmp, merge2)
Similar answer as kreativitea's above, but with more info (i think!)
So profiling the actual merge functions, for the merging of two 50K arrays,
merge 1
311748 function calls in 15.363 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 15.363 15.363 <string>:1(<module>)
1 15.322 15.322 15.362 15.362 merge.py:3(merge1)
221309 0.030 0.000 0.030 0.000 {len}
90436 0.010 0.000 0.010 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
merge2
250004 function calls in 0.104 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.104 0.104 <string>:1(<module>)
1 0.074 0.074 0.103 0.103 merge.py:20(merge2)
50000 0.005 0.000 0.005 0.000 {len}
100000 0.010 0.000 0.010 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000 0.014 0.000 0.014 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
So for merge1, it's 221309 len, 90436 append, and takes 15.363 seconds.
So for merge2, it's 50000 len, 100000 append, and 100000 pop and takes 0.104 seconds.
len and append pop are all O(1) (more info here), so these profiles aren't showing what's actually taking the time, since going of just that, it should be faster, but only ~20% so.
Okay the cause is actually fairly obvious if you just read the code:
In the first method, there is this line:
inv += len(left[i:])
so every time that is called, it has to rebuild an array. If you comment out this line (or just replace it with inv += 1 or something) then it becomes faster than the other method. This is the single line responsible for the increased time.
Having noticed this is the cause, the issue can be fixed by improving the code; change it to this for a speed up. After doing this, it will be faster than merge2
inv += len(left) - i
Update it to this:
def merge3(left, right):
i = j = inv = 0
merged = []
while i < len(left) and j < len(right):
if left[i] <= right[j]:
merged.append(left[i])
i += 1
else:
merged.append(right[j])
j += 1
inv += len(left) - i
merged += left[i:]
merged += right[j:]
return merged, inv
You can use the excellent cProfile module to help you solve things like this.
>>> import cProfile
>>> a = range(1,20000,2)
>>> b = range(0,20000,2)
>>> cProfile.run('merge1(a, b)')
70002 function calls in 0.195 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.184 0.184 0.195 0.195 <pyshell#7>:1(merge1)
1 0.000 0.000 0.195 0.195 <string>:1(<module>)
50000 0.008 0.000 0.008 0.000 {len}
19999 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
>>> cProfile.run('merge2(a, b)')
50004 function calls in 0.026 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.016 0.016 0.026 0.026 <pyshell#12>:1(merge2)
1 0.000 0.000 0.026 0.026 <string>:1(<module>)
10000 0.002 0.000 0.002 0.000 {len}
20000 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
20000 0.005 0.000 0.005 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
After looking at the information a bit, it looks like the commenters are correct-- its not the len function-- it's the string module. The string module is invoked when you compare the length of things, as follows:
>>> cProfile.run('0 < len(c)')
3 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
It is also invoked when slicing a list, but this is a very quick operation.
>>> len(c)
20000000
>>> cProfile.run('c[3:2000000]')
2 function calls in 0.011 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.011 0.011 0.011 0.011 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
TL;DR: Something in the string module is taking 0.195s in your first function, and 0.026s in your second function. : apparently, the rebuilding of the array in inv += len(left[i:]) this line.
If I had to guess, I would say it probably has to do with the cost of removing elements from a list, removing from the end (pop) is quicker than removing from the beginning. the second favors removing elements from the end of the list.
See Performance Notes: http://effbot.org/zone/python-list.htm
"The time needed to remove an item is about the same as the time needed to insert an item at the same location; removing items at the end is fast, removing items at the beginning is slow."