Python: Very slow execution loops - python

I am writing a code for proposing typo correction using HMM and Viterbi algorithm. At some point for each word in the text I have to do the following. (lets assume I have 10,000 words)
#FYI Windows 10, 64bit, interl i7 4GRam, Python 2.7.3
import numpy as np
import pandas as pd
for k in range(10000):
tempWord = corruptList20[k] #Temp word read form the list which has all of the words
delta = np.zeros(26, len(tempWord)))
sai = np.chararray(26, len(tempWord)))
sai[:] = '#'
# INITIALIZATION DELTA
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is different
# INITILIZATION END
# 6.DELTA CALCULATION
for deltaIndex in range(1, len(tempWord)):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
# 7. SAI BACKWARD TRACKING
delta2 = pd.DataFrame(delta)
sai2 = pd.DataFrame(sai)
proposedWord = np.zeros(len(tempWord), str)
editId = 0
for col in delta2.columns:
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
editList20.append(''.join(editWord))
#END OF LOOP
As you can see it is computationally involved and When I run it takes too much time to run.
Currently my laptop is stolen and I run this on Windows 10, 64bit, 4GRam, Python 2.7.3
My question: Anybody can see any point that I can use to optimize? Do I have to delete the the matrices I created in the loop before loop goes to next round to make memory free or is this done automatically?
After the below comments and using xrange instead of range the performance increased almost by 30%. I am adding the screenshot here after this change.

I don't think that range discussion makes much difference. With Python3, where range is the iterator, expanding it into a list before iteration doesn't change time much.
In [107]: timeit for k in range(10000):x=k+1
1000 loops, best of 3: 1.43 ms per loop
In [108]: timeit for k in list(range(10000)):x=k+1
1000 loops, best of 3: 1.58 ms per loop
With numpy and pandas the real key to speeding up loops is to replace them with compiled operations that work on the whole array or dataframe. But even in pure Python, focus on streamlining the contents of the iteration, not the iteration mechanism.
======================
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication
A minor change: delta[i, 0] = ...; this is the array way of addressing a single element; functionally it often is the same, but the intent is clearer. But think, can't you set all of that column as once?
delta[:,0] = ...
====================
N = len(tempWord)
delta = np.zeros(26, N))
etc
In tight loops temporary variables like this can save time. This isn't tight, so here is just adds clarity.
===========================
This one ugly nested triple loop; admittedly 26 steps isn't large, but 26*26*N is:
for deltaIndex in range(1,N):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
Focus on replacing this with array operations. It's those 3 commented lines that need to be changed, not the iteration mechanism.
================
Make proposedWord a list rather than array might be faster. Small list operations are often faster than array one, since numpy arrays have a creation overhead.
In [136]: timeit np.zeros(20,str)
100000 loops, best of 3: 2.36 µs per loop
In [137]: timeit x=[' ']*20
1000000 loops, best of 3: 614 ns per loop
You have to careful when creating 'empty' lists that the elements are truly independent, not just copies of the same thing.
In [159]: %%timeit
x = np.zeros(20,str)
for i in range(20):
x[i] = chr(65+i)
.....:
100000 loops, best of 3: 14.1 µs per loop
In [160]: timeit [chr(65+i) for i in range(20)]
100000 loops, best of 3: 7.7 µs per loop

As noted in the comments, the behavior of range changed between Python 2 and 3.
In 2, range constructs an entire list populated with the numbers to iterate over, then iterates over the list. Doing this in a tight loop is very expensive.
In 3, range instead constructs a simple object that (as far as I know), consists only of 3 numbers: the starting number, the step (distance between numbers), and the end number. Using simple math, you can calculate any point along the range instead of needing to iterate necessarily. This makes "random access" on it O(1) instead of O(n) when the entire list is interated, and prevents the creation of a costly list.
In 2, use xrange to iterate over a range object instead of a list.
(#Tom: I'll delete this if you post an answer).

It's hard to see exactly what you need to do because of the missing code, but it's clear that you need to learn how to vectorize your numpy code. This can lead to a 100x speedup.
You can probably get rid of all the inner for-loops and replace them with vectorized operations.
eg. instead of
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is differen
do
delta[:, 0] = # Vectorized form of whatever operation you were going to do.

Related

Performing matrix operation on two large matrices

I have two large matrices (40000*4096) and I would like to compare and match each row of the first matrix to all of the rows for the second matrix and as a result, the output will have a size (40000*40000). However, since I need to do this for several thousand times, it is wildy time consuming 26k seconds for each iteration so for 5000 times ...
I would be glad if you could give me some smart suggestion. Thank you.
P.S. this is what I did so far for just one iteration (1 of 5000)
def matcher(Antigens, Antibodies,ind):
temp = np.zeros((Antibodies.shape[0],Antibodies.shape[1]))
output = np.zeros((Antibodies.shape[0],1))
for i in range(len(Antibodies)):
temp[i] = np.int32(np.equal(Antigens[ind],Antibodies[i]))
output[i] = np.sum(temp[i])
return output
output = [matcher(gens,Antibodies) for gens in Antigens]
Okay, I think I understand what your goal is:
Count number of row matches (antigen vs antibody matrix). Each row of the resulting vector (40,000 x 1) represents a count of exact matches between 1 antigen row and all of the antibodies row (so values from 0 - 40_000).
I made some fake data:
import numpy as np
import numba as nb
num_mat = 5 # number of matrices
num_row = 10_000 # number of rows per matrix
num_elm = 4_096 # number of elements per row
dim = (num_mat,num_row,num_elm)
Antigens = np.random.randint(0,256,dim,dtype=np.uint8)
Antibodies = np.random.randint(0,256,dim,dtype=np.uint8)
There's one important point here, I reduced the matrices to the smallest datatype that can represent the data in order to reduce their memory foot-print. I'm not sure what your data looks like, but hopefully you can do this as well.
Also, the following code assumes your dimensions look the fake data:
(number of matrices, rows, elements)
#nb.njit
def match_arr(arr1, arr2):
for i in range(arr1.shape[0]): #4096 vs 4096
if arr1[i] != arr2[i]:
return False
return True
#nb.njit
def match_mat_sum(ag, ab):
out = np.zeros((ag.shape[0])) # 40000
for i in range(ag.shape[0]):
tmp = 0
for j in range(ab.shape[0]):
tmp += match_arr(ag[i], ab[j])
out[i] = tmp
return out
#nb.njit(parallel=True)
def match_sets(Antigens, Antibodies):
out = np.empty((Antigens.shape[0] * Antibodies.shape[0], Antigens.shape[1])) # 5000 x 40000
# multiprocessing per antigen matrix, may want to move this as suits your data
for i in nb.prange(Antigens.shape[0]):
for j in range(Antibodies.shape[0]):
out[j+(5*i)] = match_mat_sum(Antigens[i], Antibodies[j]) # need to figure out the index to avoid race conditions
return out
I lean on Numba heavily. One of the key optimizations is not to check the equivalence of entire rows with np.equal() but to write a custom function match_arr() that breaks as soon as it finds a mis-matched element. Hopefully, this lets us skip a ton of comparisons.
Time comparison:
%timeit match_arr(arr1, arr2)
314 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.equal(arr1, arr2)
1.07 µs ± 5.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
match_mat_sum
This function simply calculates the middle step (the 40,000 x 1 vector) that represents the sum of exact matches between two matrices. This step reduces two matrices like: (m x n), (o x n) -> (m)
match_sets()
The last function parallelizes this operation with explicit parallel loops through nb.prange. You might want to move this function to a different loop depending on what your data looks like (like if you have one antigen matrix, but 5000 antibody matrices, you should move prange to the inner loop or you'll not be leveraging parallelization). The fake data assumes some antigen and some antibody matrices.
Another important thing to note here is the indexing on the out array. In order to avoid race conditions, each explicit loops needs to write to a unique space. Again, depending on your data, you'll need to index the proper "place" to put the result.
On a Ryzen 1600 (6-core) with 16 gigs of RAM, using this fake data, I generated a result in 10.2 seconds.
Your data is about 3200x times larger. Assuming linear scaling, the full set would take approximately 9 hours, assuming you have enough memory.
You could write some kind of batch loader as well, rather than loading 5000 giant matrices directly into memory.
This problem can be tackled with a mixture of numpy broadcasting, and the module numexpr, which performs operations fast while minimizing the storage of intermediate values
import numexpr as ne
# expand arrays dimensions to support broadcasting when doing comparison
Antigens, Antibodies = Antigens[None, :, :], Antibodies[:, None, :]
output = ne.evaluate('sum((Antigens==Antibodies)*1, axis=2)')
# *1 is a hack because numexpr does not currently support sum on bool
This may be faster than your current solution, but for such large arrays it will take a while.
The performance of numexpr for this operations is a bit lackluster, but you can at least use broadcasting inside the loop:
output = np.zeros((Antibodies.shape[0],)*2, dtype=np.int32)
for row, out_row in zip(Antibodies, output):
(row[None,:]==Antigens).sum(1, out=out_row)

Slow random sample generation without replacement in scipy

I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i
to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.
:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object.
Here is my attempt:
start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0 ], [0.5, 0.5]),name = 'dist')
data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
row = numpy.empty(self._s*self._n)
print time.time()-start_time
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))
print time.time()-start_time
return S
Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
looks like something that could be written as one non-looping expression; though it probably isn't a big time consumer
My first guess is - but I need to play with this to be sure; I think it's assigning all rows the column index number.
col = np.empty(self._s, self._n)
col[:,:] = np.arange(self._n)
col = col.ravel()
Something similar for:
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
is, I think, picking _s values from _d _n times. Doing the no-replace along _s, but allowing replace on _n could be tricky.
Without running the code myself (with smaller n) I'm stumbling around a bit. Which is the slow part, generating col, row, or the final csr? Iteration on n=500000 is going to be slow.
The matrix will be (1000, 500000), but with (10*500000) nonzero items. So a sparsity of .01. Just for comparison it would be interesting to generate a sparse random matrix of similar size and sparsity
In [5]: %timeit sparse.random(1000, 500000, .01)
1 loop, best of 3: 24.6 s per loop
and the dense random choices:
In [8]: timeit np.random.choice(1000,(10,500000)).shape
10 loops, best of 3: 53 ms per loop
In [9]: np.array([np.random.choice(1000,(10,)) for i in range(500000)]).shape
Out[9]: (500000, 10)
In [10]: timeit np.array([np.random.choice(1000,(10,)) for i in range(500000)]).
...: shape
1 loop, best of 3: 12.7 s per loop
So, yes, the large iteration loop is expensive. But given the replacement policy there might not be a way around that. Or is there?
So as first guess, creating row takes half the time, creating the sparse matrix the other half. I'm not surprised. You are using the coo style of input, which requires lexsorting and summing for duplicates when converting to csr. We might be able to gain speed by using the indptr type of input. There won't be duplicates to sum. And since there are consistently 10 nonzero terms per row, generating the indptr values won't be hard. But I can't do that off the top of my head. (oops, that's the transpose).
random sparse to csr is just a bit slower:
In [11]: %timeit sparse.random(1000, 500000, .01, 'csr')
1 loop, best of 3: 28.3 s per loop

Possible to use numba.guvectorize to emulate parallel forall / prange?

As a user of Python for data analysis and numerical calculations, rather than a real "coder", I had been missing a really low-overhead way of distributing embarrassingly parallel loop calculations on several cores.
As I learned, there used to be the prange construct in Numba, but it was abandoned because of "instability and performance issues".
Playing with the newly open-sourced #guvectorize decorator I found a way to use it for virtually no-overhead emulation of the functionality of late prange.
I am very happy to have this tool at hand now, thanks to the guys at Continuum Analytics, and did not find anything on the web explicitly mentioning this use of #guvectorize. Although it may be trivial to people who have been using NumbaPro earlier, I'm posting this for all those fellow non-coders out there (see my answer to this "question").
Consider the example below, where a two-level nested for loop with a core doing some numerical calculation involving two input arrays and a function of the loop indices is executed in four different ways. Each variant is timed with Ipython's %timeit magic:
naive for loop, compiled using numba.jit
forall-like construct using numba.guvectorize, executed in a single thread (target = "cpu")
forall-like construct using numba.guvectorize, executed in as many threads as there are cpu "cores" (in my case hyperthreads) (target = "parallel")
same as 3., however calling the "guvectorized" forall with the sequence of "parallel" loop indices randomly permuted
The last one is done because (in this particular example) the inner loop's range depends on the value of the outer loop's index. I don't know how exactly the dispatchment of gufunc calls is organized inside numpy, but it appears as if the randomization of "parallel" loop indices achieves slightly better load balancing.
On my (slow) machine (1st gen core i5, 2 cores, 4 hyperthreads) I get the timings:
1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop
Note: I'd be interested if this recipe readily applies to target="gpu" (it should do, but I don't have access to a suitable graphics card right now), and what's the speedup. Please post!
And here's the example:
import numpy as np
from numba import jit, guvectorize, float64, int64
#jit
def naive_for_loop(some_input_array, another_input_array, result):
for i in range(result.shape[0]):
for k in range(some_input_array.shape[0] - i):
result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
arg_size = 20000
input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)
# do single-threaded naive nested for loop
# reset result_array inside %timeit call
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()
# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()
# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()
# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()
# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))

how to speed up nested loops in Python

I am trying to write a piece of nested loops in my algorithm, and meet some problems that the whole algorithm takes too long time due to these nested loops. I am quite new to Python (as you may find from my below unprofessional code :( ) and hopefully someone can guide me a way to speed up my code!
The whole algorithm is for fire detection in multi 1500*6400 arrays. A small contextual analyse is applied when go through the whole array. The contextual analyse is performed in a dynamically assigned windows size way. The windows size can go from 11*11 to 31*31 until the validate values inside the sampling windows are enough for the next round calculation, for example like below:
def ContextualWindows (arrb4,arrb5,pfire):
####arrb4,arrb5,pfire are 31*31 sampling windows from large 1500*6400 numpy array
i=5
while i in range (5,16):
arrb4back=arrb4[15-i:16+i,15-i:16+i]
## only output the array data when it is 'large' enough
## to have enough good quality data to do calculation
if np.ma.count(arrb4back)>=min(10,0.25*i*i):
arrb5back=arrb5[15-i:16+i,15-i:16+i]
pfireback=pfire[15-i:16+i,15-i:16+i]
canfire=0
i=20
else:
i=i+1
###unknown pixel: background condition could not be characterized
if i!=20:
canfire=1
arrb5back=arrb5
pfireback=pfire
arrb4back=arrb4
return (arrb4back,arrb5back,pfireback,canfire)
Then this dynamic windows will be feed into next round test, for example:
b4backave=np.mean(arrb4Windows)
b4backdev=np.std(arrb4Windows)
if b4>b4backave+3.5*b4backdev:
firetest=True
To run the whole code to my multi 1500*6400 numpy arrays, it took over half an hour, or even longer. Just wondering if anyone got an idea how to deal with it? A general idea which part I should put my effort to would be greatly helpful!
Many thanks!
Avoid while loops if speed is a concern. The loop lends itself to a for loop as start and end are fixed. Additionally, your code does a lot of copying which isn't really necessary. The rewritten function:
def ContextualWindows (arrb4,arrb5,pfire):
''' arrb4,arrb5,pfire are 31*31 sampling windows from
large 1500*6400 numpy array '''
for i in range (5, 16):
lo = 15 - i # 10..0
hi = 16 + i # 21..31
# only output the array data when it is 'large' enough
# to have enough good quality data to do calculation
if np.ma.count(arrb4[lo:hi, lo:hi]) >= min(10, 0.25*i*i):
return (arrb4[lo:hi, lo:hi], arrb5[lo:hi, lo:hi], pfire[lo:hi, lo:hi], 0)
else: # unknown pixel: background condition could not be characterized
return (arrb4, arrb5, pfire, 1)
For clarity I've used style guidelines from PEP 8 (like extended comments, number of comment chars, spaces around operators etc.). Copying of a windowed arrb4 occurs twice here but only if the condition is fulfilled and this will happen only once per function call. The else clause will be executed only if the for-loop has run to it's end. We don't even need a break from the loop as we exit the function altogether.
Let us know if that speeds up the code a bit. I don't think it'll be much but then again there isn't much code anyway.
I've run some time tests with ContextualWindows and variants. One i step takes about 50us, all ten about 500.
This simple iteration takes about the same time:
[np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
The iteration mechanism, and the 'copying' arrays are minor parts of the time. Where possible numpy is making views, not copies.
I'd focus on either minimizing the number of these count steps, or speeding up the count.
Comparing times for various operations on these windows:
First time for 1 step:
In [167]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,6)]
10000 loops, best of 3: 43.9 us per loop
now for the 10 steps:
In [139]: timeit [arrb4[15-i:16+i,15-i:16+i].shape for i in range(5,16)]
10000 loops, best of 3: 33.7 us per loop
In [140]: timeit [np.sum(arrb4[15-i:16+i,15-i:16+i]>500) for i in range(5,16)]
1000 loops, best of 3: 390 us per loop
In [141]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
1000 loops, best of 3: 464 us per loop
Simply indexing does not take much time, but testing for conditions takes substantially more.
cumsum is sometimes used to speed up sums over sliding windows. Instead of taking sum (or mean) over each window, you calculate the cumsum and then use the differences between the front and end of window.
Trying something like that, but in 2d - cumsum in both dimensions, followed by differences between diagonally opposite corners:
In [164]: %%timeit
.....: cA4=np.cumsum(np.cumsum(arrb4,0),1)
.....: [cA4[15-i,15-i]-cA4[15+i,15+i] for i in range(5,16)]
.....:
10000 loops, best of 3: 43.1 us per loop
This is almost 10x faster than the (nearly) equivalent sum. Values don't quite match, but timing suggest that this may be worth refining.

How to insert sort through swap or through pop() - insert()?

I've made my own version of insertion sort that uses pop and insert - to pick out the current element and insert it before the smallest element larger than the current one - rather than the standard swapping backwards until a larger element is found. When I run the two on my computer, mine is about 3.5 times faster. When we did it in class, however, mine was much slower, which is really confusing. Anyway, here are the two functions:
def insertionSort(alist):
for x in range(len(alist)):
for y in range(x,0,-1):
if alist[y]<alist[y-1]:
alist[y], alist[y-1] = alist[y-1], alist[y]
else:
break
def myInsertionSort(alist):
for x in range(len(alist)):
for y in range(x):
if alist[y]>alist[x]:
alist.insert(y,alist.pop(x))
break
Which one should be faster? Does alist.insert(y,alist.pop(x)) change the size of the list back and forth, and how does that affect time efficiency?
Here's my quite primitive test of the two functions:
from time import time
from random import shuffle
listOfLists=[]
for x in range(100):
a=list(range(1000))
shuffle(a)
listOfLists.append(a)
start=time()
for i in listOfLists:
myInsertionSort(i[:])
myInsertionTime=time()-start
start=time()
for i in listOfLists:
insertionSort(i[:])
insertionTime=time()-start
print("regular:",insertionTime)
print("my:",myInsertionTime)
I had underestimated your question, but it actually isn't easy to answer. There are a lot of different elements to consider.
Doing lst.insert(y, lst.pop(x)) is a O(n) operation, because lst.pop(x) costs O(len(lst) - x) since list elements must be contiguous, and thus the list has to shift-left by one all the elements after index x, and dually lst.insert(y, _) costs O(len(lst) - y) since it has to shift all the elements right by one.
This means that a naive analysis can give an upperbound of O(n^3) complexity in the worst case for your code. As you suggested this is actually correct [remember that O(n^2) is a subset of O(n^3)], however it's not a tight upperbound because you swap each element only once. So for n times you do n work, and this complexity is indeed O(n * n + n^2) = O(n^2), where the second n^2 refers to the number of comparisons which is n^2 in the worst case. So, asymptotically your solution is the same as insertion sort.
The first algorithm and the second algorithm change the order of iterations over the y. As I have already commented this changes the worst-case for the algorithm.
While insertion sort has its worst-case with reverse-sorted sequences, your algorithm doesn't (which is actually good). This might be a factor that adds to the difference in timings since if you do not use random lists you might use an input that is worst-case for one algorithm but not worst-case for the other.
In [2]: %timeit insertionSort(list(range(10)))
100000 loops, best of 3: 5.46 us per loop
In [3]: %timeit myInsertionSort(list(range(10)))
100000 loops, best of 3: 8.47 us per loop
In [4]: %timeit insertionSort(list(reversed(range(10))))
10000 loops, best of 3: 20.4 us per loop
In [5]: %timeit myInsertionSort(list(reversed(range(10))))
100000 loops, best of 3: 9.81 us per loop
You should always tests with (also) random inputs with different lengths.
The average complexity of insertion sort is O(n^2). Your algorithm might have a lower average time, however it's not entirely trivial to compute it.
I don't get why you use the insert+pop at all when you can use the swap. Trying this on my machine yields a quite big improvement in efficiency since you reduce an O(n^2) component to a O(n) component.
Now, you ask why there was such a big change between the execution at home and in class.
There can be various reasons, for example if you did not use a random generated list you might have used an almost best-case input for insertion sort while it was an almost worst-case input for your algorithm. And similar considerations. Without seeing what you did in class is not possible to give an exact answer.
However I believe there is a very simple answer: you forgot to copy the list before profiling. This is the same error I did when I first posted this answer (quote from the previous answer):
If you want to compare the two functions you should use random
lists:
In [6]: import random
...: input_list = list(range(10))
...: random.shuffle(input_list)
...:
In [7]: %timeit insertionSort(input_list) # Note: no input_list[:]!! Argh!
100000 loops, best of 3: 4.82 us per loop
In [8]: %timeit myInsertionSort(input_list)
100000 loops, best of 3: 7.71 us per loop
Also you should use big inputs to see the difference clearly:
In [11]: input_list = list(range(1000))
...: random.shuffle(input_list)
In [12]: %timeit insertionSort(input_list) # Note: no input_list[:]! Argh!
1000 loops, best of 3: 508 us per loop
In [13]: %timeit myInsertionSort(input_list)
10 loops, best of 3: 55.7 ms per loop
Note also that I, unfortunately, always executed the pairs of profilings in the same order, confirming my previous ideas.
As you can see all calls to insertionSort except the first one used a sorted list as input, which is the best-case for insertion-sort! This means that the timing for insertion sort is wrong (and I'm sorry for having written this before!) While myInsertionSort was always executed with an already sorted list, and guess what? Turns out that one of the worst-cases for myInsertionSort is the sorted list!
think about it:
for x in range(len(alist)):
for y in range(x):
if alist[y]>alist[x]:
alist.insert(y,alist.pop(x))
break
If you have a sorted list the alist[y] > alist[x] comparison will always be false. You might say "perfect! no swaps => no O(n) work => better timing", unfortunately this is false because no swaps also mean no break and hence you are doing n*(n+1)/2 iterations, i.e. the worst-case performance.
Note that this is very bad!!! Real-world data really often is partially sorted, so an algorithm whose worst-case is the sorted list is usually not a good algorithm for real-world use.
Note that this does not change if you replace insert + pop with a simple swap, hence the algorithm itself is not good from this point of view, independently from the implementation.

Categories