Is python smart enough to replace function calls with constant result? - python

Coming from the beautiful world of c, I am trying understand this behavior:
In [1]: dataset = sqlContext.read.parquet('indir')
In [2]: sizes = dataset.mapPartitions(lambda x: [len(list(x))]).collect()
In [3]: for item in sizes:
...: if(item == min(sizes)):
...: count = count + 1
...:
would not even finish after 20 minutes, and I know that the list sizes is not that big, less than 205k in length. However this executed instantly:
In [8]: min_item = min(sizes)
In [9]: for item in sizes:
if(item == min_item):
count = count + 1
...:
So what happened?
My guess: python could not understand that min(sizes) will be always constant, thus replace after the first few calls with its return value..since Python uses the interpreter..
Ref of min() doesn't say anything that would explain the matter to me, but what I was thinking is that it may be that it needs to look at the partitions to do that, but that shouldn't be the case, since sizes is a list, not an RDD!
Edit:
Here is the source of my confusion, I wrote a similar program in C:
for(i = 0; i < SIZE; ++i)
if(i == mymin(array, SIZE))
++count;
and got these timings:
C02QT2UBFVH6-lm:~ gsamaras$ gcc -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 98.679177000 seconds wall clock time.
C02QT2UBFVH6-lm:~ gsamaras$ gcc -O3 -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 0.000000000 seconds wall clock time.
and for timings, I used Nomimal Animal's approach from my Time measurements.

I'm by no means an expert on the inner workings of python, but from my understanding thus far you'd like to compare the speed of
for item in sizes:
if(item == min(sizes)):
count = count + 1
and
min_item = min(sizes)
for item in sizes:
if(item == min_item):
count = count + 1
Now someone correct me if I have any of this wrong but,
In python lists are mutable and do not have a fixed length, and are treated as such, while in C an array has a fixed size. From this question:
Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your array time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays.
Now take this example
for item in sizes:
if(item == min(sizes)):
new_item = item - 1
sizes.append(new_item)
Then the value of item == min(sizes) would be different on the next iteration. Python doesn't cache the resulting value of min(sizes) since it would break the above example, or require some logic to check if the list has been changed. Instead it leaves that up to you. By defining min_item = min(sizes) you are essentially caching the result yourself.
Now since the array is a fixed size in C, it can find the min value with less overhead than a python list, thus why I think it has no problem in C (as well as C being a much lower level language).
Again, I don't fully understand the underlying code and compilation for python, and I'm certain if you analyzed the process of the loops in python, you'd see python repeatedly computing min(sizes), causing the extreme amount of lag. I'd love to learn more about the inner workings of python (for example, are any methods cached in a loop for python, or is everything computed again for each iteration?) so if anyone has more info and/or corrections, let me know!

Related

Efficient Way to Repeatedly Split Large NumPy Array and Record Middle

I have a large NumPy array nodes = np.arange(100_000_000) and I need to rearrange this array by:
Recording and then removing the middle value in the array
Split the array into the left half and right half
Repeat Steps 1-2 for each half
Stop when all values are exhausted
So, for a smaller input example nodes = np.arange(10), the output would be:
[5 2 8 1 4 7 9 0 3 6]
This was accomplished by naively doing:
import numpy as np
def split(node, out):
mid = len(node) // 2
out.append(node[mid])
return node[:mid], node[mid+1:]
def reorder(a):
nodes = [a.tolist()]
out = []
while nodes:
tmp = []
for node in nodes:
for n in split(node, out):
if n:
tmp.append(n)
nodes = tmp
return np.array(out)
if __name__ == "__main__":
nodes = np.arange(10)
print(reorder(nodes))
However, this is way too slow for nodes = np.arange(100_000_000) and so I am looking for a much faster solution.
You can vectorize your function with Numpy by working on groups of slices.
Here is an implementation:
# Similar to [e for tmp in zip(a, b) for e in tmp] ,
# but on Numpy arrays and much faster
def interleave(a, b):
assert len(a) == len(b)
return np.column_stack((a, b)).reshape(len(a) * 2)
# n is the length of the input range (len(a) in your example)
def fast_reorder(n):
if n == 0:
return np.empty(0, dtype=np.int32)
startSlices = np.array([0], dtype=np.int32)
endSlices = np.array([n], dtype=np.int32)
allMidSlices = np.empty(n, dtype=np.int32) # Similar to "out" in your implementation
midInsertCount = 0 # Actual size of allMidSlices
# Generate a bunch of middle values as long as there is valid slices to split
while midInsertCount < n:
# Generate the new mid/left/right slices
midSlices = (endSlices + startSlices) // 2
# Computing the next slices is not needed for the last step
if midInsertCount + len(midSlices) < n:
# Generate the nexts slices (possibly with invalid ones)
newStartSlices = interleave(startSlices, midSlices+1)
newEndSlices = interleave(midSlices, endSlices)
# Discard invalid slices
isValidSlices = newStartSlices < newEndSlices
startSlices = newStartSlices[isValidSlices]
endSlices = newEndSlices[isValidSlices]
# Fast appending
allMidSlices[midInsertCount:midInsertCount+len(midSlices)] = midSlices
midInsertCount += len(midSlices)
return allMidSlices[0:midInsertCount]
On my machine, this is 89 times faster than your scalar implementation with the input np.arange(100_000_000) dropping from 2min35 to 1.75s. It also consume far less memory (rougthly 3~4 times less). Note that if you want a faster code, then you probably need to use a native language like C or C++.
Edit:
The question has been updated to have a much smaller input array so I leave the below for historical reasons. Basically it was likely a typo but we often get accustomed to computers working with insanely large numbers and when memory is involved they can be a real problem.
There is already a numpy based solution submitted by someone else that I think fits the bill.
Your code requires an insane amount of RAM just to hold 100 billion 64 bit integers. Do you have 800GB of RAM? Then you convert the numpy array to a list which will be substantially larger than the array (each packed 64 bit int in the numpy array will become a much less memory efficient python int object and the list will have a pointer to that object). Then you make a lot of slices of the list which will not duplicate the data but will duplicate the pointers to the data and use even more RAM. You also append all the result values to a list a single value at a time. Lists are very fast for adding items generally but with such an extreme size this will not only be slow but the way the list is allocated is likely to be extremely wasteful RAM wise and contribute to major problems (I believe they double in size when they get to a certain level of fullness so you will end up allocating more RAM than you need and doing many allocations and likely copies). What kind of machine are you running this on? There are ways to improve your code but unless you're running it on a super computer I don't know that you're going to ever finish that calculation. I only..only? have 32GB of RAM and I'm not going to even try to create a 100B int_64 numpy array as I don't want to use up ssd write life for a mass of virtual memory.
As for improving your code stick to numpy arrays don't change to a python list it will greatly increase the RAM you need. Preallocate a numpy array to put the answer in. Then you need a new algorithm. Anything recursive or recursive like (ie a loop splitting the input,) will require tracking a lot of state, your nodes list is going to be extraordinarily gigantic and again use a lot of RAM. You could use len(a) to indicate values that are removed from your list and scan through the entire array each time to figure out what to do next but that will save RAM in favour of a tremendous amount of searching a gigantic array. I feel like there is an algorithm to cut numbers from each end and place them in the output and just track the beginning and end but I haven't figured it out at least not yet.
I also think there is a simpler algorithm where you just track the number of splits you've done instead of making a giant list of slices and keeping it all in memory. Take the middle of the left half and then the middle of the right then count up one and when you take the middle of the left half's left half you know you have to jump to the right half then the count is one so you jump over to the original right half's left half and on and on... Based on the depth into the halves and the length of the input you should be able to jump around without scanning or tracking all of those slices though I haven't been able to dedicate much time to thinking this through in my head.
With a problem of this nature if you really need to push the limits you should consider using C/C++ so you can be as efficient as possible with RAM usage and because you're doing an insane number of tiny things which doesn't map well to python performance.

Is it computationally more intensive to assign value to a variable or repeat a simple operation?

Now i may be nitpicking here but i wanted to know which is computationally more efficient over a large number of iterations (assuming no restrictions on memory) purely in terms of time taken to run the overall programme.
i am asking mainly in terms of two languages python and c, since these are the two i use most.
for e.g. in c, something like:
int count, sum = 0, 0;
while (count < 99999){
if (((pow(count, 2))/10)%10 == 4) {// just some random operation
sum += pow(count, 2);
}
count++;
}
or
int count, sum, temp = 0, 0, 0;
while (count < 99999){
temp = pow(count, 2)
if (((temp/10)%10 == 4) {// just some random operation
sum += temp;
}
count++;
}
In python something like
for i in range (99999):
n = len(str(i))
print "length of string", str(i), "is", n
or
for i in range (99999):
temp = str(i)
n = len(temp)
print "length of string:, temp, "is", n
Now these are just random operations i thought on the fly, my main question is whether it is better create a new variable or just repeat the operation. i know the computing power required will change as i, count increases but just generalizing for a large number of iterations which is better.
I have tested both your codes in python with timeit, and the result was 1.0221271729969885 seconds for the first option, 1.0154028950055363 for the second option
I have done this test only with one iteration on each example and as you can see, they are extremely close to each other, too close to be reliable after only one iteration. If you want to learn more though, I suggest you do these tests yourself with timeit
However, the action you took here is to replace a variable assignment to temp with 2 calls for str(i), not one, and it can of course vary wildly for other functions that are more or less complicated and time consuming than str is
Caching does improve performance and readability of the code most of the time. There may be corner cases for some super fast operations, however I'd say the type of code used in second examples would always be better.
Cached value will most probably be stored in register, hence the one mov will be involved. According to this document there are not so many cheaper operations.
As for readability consider bigger function calls with more arguments, while reviewing code it will require more brainwork to check that calls are the same, moreover if function changes it may introduce bugs where one call has been modified and the other one has been forgotten (true mostly for Python).
All in all it should be readability driving your choices in such cases rather than performance gains. Unless profiler points at this piece of code.

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.
If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.
Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Faster way to access 2D lists/arrays and do calculations in Python?

I am running 2 nested loops (first one 120 runs, second one 500 runs).
In most of the 120x500 runs I need to access several lists and lists-in-lists (I will call them 2D arrays).
At the moment the 120x500 runs take about 4 seconds. Most of the time is taken by the three list appends and several 2D array accesses.
The arrays are prefilled by me outside of the loops.
Here is my code:
#Range from 0 to 119
for cur_angle in range(0, __ar_angular_width-1):
#Range from 0 to 499
for cur_length in range(0, int(__ar_length * range_res_scale)-1):
v_x = (auv_rot_mat_0_0*self.adjacent_dx[cur_angle][cur_length])+(auv_rot_mat_0_1*self.opposite_dy[cur_angle][cur_length])
v_y = (auv_rot_mat_1_0*self.adjacent_dx[cur_angle][cur_length])+(auv_rot_mat_1_1*self.opposite_dy[cur_angle][cur_length])
v_x_diff = (v_x+auv_trans_x) - ocp_grid_origin_x
v_y_diff = (v_y+auv_trans_y) - ocp_grid_origin_y
p_x = (m.floor(v_x_diff/ocp_grid_resolution))
p_y = (m.floor(v_y_diff/ocp_grid_resolution))
data_index = int(p_y * ocp_grid_width + p_x)
if data_index >= 0 and data_index < (len(ocp_grid.data)-1):
probability = ocp_grid.data[data_index]
if probability == 100:
if not m.isnan(self.v_directions[cur_angle]):
magnitude = m.pow(probability, 2) * self.magnitude_ab[cur_length]
ov_1 = self.v_directions[cur_angle]
ov_2 = magnitude
ov_3 = self.distances[cur_length]
obstacle_vectors.append(ov_1)
obstacle_vectors.append(ov_2)
obstacle_vectors.append(ov_3)
I tried to figure out the processing times via time.time() and building differences, but it didn't work reliable. The calculated times were fluctuating quite a lot.
I am not really a Python pro, so any advices are welcome.
Any ideas how to make the code faster?
EDIT:
The initialization of the arrays was done with this code:
self.adjacent_dx = [i[:] for i in [[0]*(length_iterations-1)]*(angular_iterations-1)]
self.opposite_dy = [i[:] for i in [[0]*(length_iterations-1)]*(angular_iterations-1)]
Some general tips:
Try to optimize your algorithm the best You can.
If it's possible, then use PyPy instead of regular Python. This usually doesn't require any code modification if all Your external dependencies work with it.
Static typing and compilation to C can add additional boost, but requires some simple code modifications. For this purpose You can use Cython.
Note that step 1 is very ofter hard to do and most time consuming, giving you small amounts of boost if you have already good code, while the steps 2 and 3 give you dramatic boost without much additional effort.

Why is my MergeSort so slow in Python?

I'm having some troubles understanding this behaviour.
I'm measuring the execution time with the timeit-module and get the following results for 10000 cycles:
Merge : 1.22722930395
Bubble: 0.810706578175
Select: 0.469924766812
This is my code for MergeSort:
def mergeSort(array):
if len(array) <= 1:
return array
else:
left = array[:len(array)/2]
right = array[len(array)/2:]
return merge(mergeSort(left),mergeSort(right))
def merge(array1,array2):
merged_array=[]
while len(array1) > 0 or len(array2) > 0:
if array2 and not array1:
merged_array.append(array2.pop(0))
elif (array1 and not array2) or array1[0] < array2[0]:
merged_array.append(array1.pop(0))
else:
merged_array.append(array2.pop(0))
return merged_array
Edit:
I've changed the list operations to use pointers and my tests now work with a list of 1000 random numbers from 0-1000. (btw: I changed to only 10 cycles here)
result:
Merge : 0.0574434420723
Bubble: 1.74780097558
Select: 0.362952293025
This is my rewritten merge definition:
def merge(array1, array2):
merged_array = []
pointer1, pointer2 = 0, 0
while pointer1 < len(array1) and pointer2 < len(array2):
if array1[pointer1] < array2[pointer2]:
merged_array.append(array1[pointer1])
pointer1 += 1
else:
merged_array.append(array2[pointer2])
pointer2 += 1
while pointer1 < len(array1):
merged_array.append(array1[pointer1])
pointer1 += 1
while pointer2 < len(array2):
merged_array.append(array2[pointer2])
pointer2 += 1
return merged_array
seems to work pretty well now :)
list.pop(0) pops the first element and has to shift all remaining ones, this is an additional O(n) operation which must not happen.
Also, slicing a list object creates a copy:
left = array[:len(array)/2]
right = array[len(array)/2:]
Which means you're also using O(n * log(n)) memory instead of O(n).
I can't see BubbleSort, but I bet it works in-place, no wonder it's faster.
You need to rewrite it to work in-place. Instead of copying part of original list, pass starting and ending indexes.
For starters : I cannot reproduce your timing results, on 100 cycles and lists of size 10000. The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort and your original snippet) is posted as a gist here. I find the following results for the average duration of a single run :
Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
(Your) Original Mergesort : 0.224888720512
Now, to make your function faster, you can do a few things.
Edit : Well, apparently, I was wrong on that one (thanks cwillu). Length computation takes O(1) in python. But removing useless computation everywhere still improves things a bit (Original Mergesort: 0.224888720512, no-length Mergesort: 0.195795390606):
def nolenmerge(array1,array2):
merged_array=[]
while array1 or array2:
if not array1:
merged_array.append(array2.pop(0))
elif (not array2) or array1[0] < array2[0]:
merged_array.append(array1.pop(0))
else:
merged_array.append(array2.pop(0))
return merged_array
def nolenmergeSort(array):
n = len(array)
if n <= 1:
return array
left = array[:n/2]
right = array[n/2:]
return nolenmerge(nolenmergeSort(left),nolenmergeSort(right))
Second, as suggested in this answer, pop(0) is linear. Rewrite your merge to pop() at the end:
def fastmerge(array1,array2):
merged_array=[]
while array1 or array2:
if not array1:
merged_array.append(array2.pop())
elif (not array2) or array1[-1] > array2[-1]:
merged_array.append(array1.pop())
else:
merged_array.append(array2.pop())
merged_array.reverse()
return merged_array
This is again faster: no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079
Third - and this would only be useful as-is if you were using a language that does tail call optimization, without it , it's a bad idea - your call to merge to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).
But you can make the merge tail-recursive by using CPS (this will run out of stack size for even modest lists if you don't do tco):
def cps_merge_sort(array):
return cpsmergeSort(array,lambda x:x)
def cpsmergeSort(array,continuation):
n = len(array)
if n <= 1:
return continuation(array)
left = array[:n/2]
right = array[n/2:]
return cpsmergeSort (left, lambda leftR:
cpsmergeSort(right, lambda rightR:
continuation(fastmerge(leftR,rightR))))
Once this is done, you can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here, trick originally due to Guy Steele). Trampolining and CPS work great together.
You write a thunking function, that "records" and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).
thunk = lambda name, *args: lambda: name(*args)
You then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)
def trampoline(bouncer):
while callable(bouncer):
bouncer = bouncer()
return bouncer
Then all that's left is to "freeze" (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. Your function now returns a thunk, without recursion (and discarding its own frame), at every call:
def tco_cpsmergeSort(array,continuation):
n = len(array)
if n <= 1:
return continuation(array)
left = array[:n/2]
right = array[n/2:]
return thunk (tco_cpsmergeSort, left, lambda leftR:
thunk (tco_cpsmergeSort, right, lambda rightR:
(continuation(fastmerge(leftR,rightR)))))
mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x))
Sadly this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). OK, I guess the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack of at most log_2(10 000) = 14 ... pretty modest.
You can do slightly more involved stack-based TCO elimination in the guise of this SO answer gives:
def leftcomb(l):
maxn,leftcomb = len(l),[]
n = maxn/2
while maxn > 1:
leftcomb.append((l[n:maxn],False))
maxn,n = n,n/2
return l[:maxn],leftcomb
def tcomergesort(l):
l,stack = leftcomb(l)
while stack: # l sorted, stack contains tagged slices
i,ordered = stack.pop()
if ordered:
l = fastmerge(l,i)
else:
stack.append((l,True)) # store return call
rsub,ssub = leftcomb(i)
stack.extend(ssub) #recurse
l = rsub
return l
But this goes only a tad faster (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). Apparently, the stack-building python does at the recursive calls of our original merge sort is pretty inexpensive.
The final results ? on my machine (Ubuntu natty's stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:
Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
Original Mergesort : 0.224888720512
no-len Mergesort : 0.195795390606
no-len Mergesort + fastmerge : 0.126505711079
trampolined CPS Mergesort + fastmerge : 0.170638551712
stack-based mergesort + fastmerge: 0.144994809628
Your merge-sort has a big constant factor, you have to run it on large lists to see the asymptotic complexity benefit.
Umm.. 1,000 records?? You are still well within the polynomial cooefficient dominance here.. If I have
selection-sort: 15 * n ^ 2 (reads) + 5 * n^2 (swaps)
insertion-sort: 5 * n ^2 (reads) + 15 * n^2 (swaps)
merge-sort: 200 * n * log(n) (reads) 1000 * n * log(n) (merges)
You're going to be in a close race for a lonng while.. By the way, 2x faster in sorting is NOTHING. Try 100x slower. That's where the real differences are felt. Try "won't finish in my life-time" algorithms (there are known regular expressions that take this long to match simple strings).
So try 1M or 1G records and let us know if you still thing merge-sort isn't doing too well.
That being said..
There are lots of things causing this merge-sort to be expensive. First of all, nobody ever runs quick or merge sort on small scale data-structures.. Where you have if (len <= 1), people generally put:
if (len <= 16) : (use inline insertion-sort)
else: merge-sort
At EACH propagation level.
Since insertion-sort is has smaller coefficent cost at smaller sizes of n. Note that 50% of your work is done in this last mile.
Next, you are needlessly running array1.pop(0) instead of maintaining index-counters. If you're lucky, python is efficiently managing start-of-array offsets, but all else being equal, you're mutating input parameters
Also, you know the size of the target array during merge, why copy-and-double the merged_array repeatedly.. Pre-allocate the size of the target array at the start of the function.. That'll save at least a dozen 'clones' per merge-level.
In general, merge-sort uses 2x the size of RAM.. Your algorithm is probably using 20x because of all the temporary merge buffers (hopefully python can free structures before recursion). It breaks elegance, but generally the best merge-sort algorithms make an immediate allocation of a merge buffer equal to the size of the source array, and you perform complex address arithmetic (or array-index + span-length) to just keep merging data-structures back and forth. It won't be as elegent as a simple recursive problem like this, but it's somewhat close.
In C-sorting, cache-coherence is your biggest enemy. You want hot data-structures so you maximize your cache. By allocating transient temp buffers (even if the memory manager is returning pointers to hot memory) you run the risk of making slow DRAM calls (pre-filling cache-lines for data you're about to over-write). This is one advantage insertion-sort,selection-sort and quick-sort have over merge-sort (when implemented as above)
Speaking of which, something like quick-sort is both naturally-elegant code, naturally efficient-code, and doesn't waste any memory (google it on wikipedia- they have a javascript implementation from which to base your code). Squeezing the last ounce of performance out of quick-sort is hard (especially in scripting languages, which is why they generally just use the C-api to do that part), and you have a worst-case of O(n^2). You can try and be clever by doing a combination bubble-sort/quick-sort to mitigate worst-case.
Happy coding.

Categories