Why is my quicksort so slow in python?

Why is my quicksort so slow in python? - python

I tried to write a quicksort in python (for learning algorithms),but I found it about 10x slower than the native sort.Here's the result:
16384 numbers:
native: 5.556 ms
quicksort: 96.412 ms
65536 numbers:
native: 27.190 ms
quicksort: 436.110 ms
262144 numbers:
native: 151.820 ms
quicksort: 1975.943 ms
1048576 numbers:
native: 792.091 ms
quicksort: 9097.085 ms
4194304 numbers:
native: 3979.032 ms
quicksort: 39106.887 ms
Does it mean that there's something wrong with my implementation?
Or that's OK because the native sort uses a lot of low-level optimization?
Nevertheless, I feel it unacceptable for sorting of 1 million numbers to take nearly 10s, even though I wrote it just for learning rather than practical application. And my computer is quite fast.
Here's my code:
def quicksort(lst):
quicksortinner(lst,0,len(lst)-1)
def quicksortinner(lst,start,end):
if start>=end:
return
j=partition(lst,start,end)
quicksortinner(lst,start,j-1)
quicksortinner(lst,j+1,end)
def partition(lst,start,end):
pivotindex=random.randrange(start,end+1)
swap(lst,pivotindex,end)
pivot=lst[end]
i,j=start,end-1
while True:
while lst[i]<=pivot and i<=end-1:
i+=1
while lst[j]>=pivot and j>=start:
j-=1
if i>=j:
break
swap(lst,i,j)
swap(lst,i,end)
return i
def swap(lst,a,b):
if a==b:
return
lst[a],lst[b]=lst[b],lst[a]
In partition, i scans right and j scans left(the approach from Algorithms). Earlier I tried the way where both move right(maybe more common), and there's not much difference.

The native sort is written in C. Your quicksort is written in pure Python. A speed difference of 10x is expected. If you run your code using PyPy, you should get closer to native speed (PyPy uses a tracing JIT to achieve high performance). Likewise, Cython would give a nice speed boost as well (Cython is a Python-to-C compiler).
A way to tell if your algorithm is even in the same ballpark is to count the number of comparisons used by both sort algorithms. In finely tuned code, the comparison costs dominate the running time. Here's a tool for counting comparisons:
class CountCmps(float):
def __lt__(self, other):
global cnt
cnt += 1
return float.__lt__(self, other)
>>> from random import random
>>> data = [CountCmps(random()) for i in range(10000)]
>>> cnt = 0
>>> data.sort()
>>> cnt
119883
One other factor is your call to random.randrange() has many pure Python steps and does more work than you might expect. It will be a non-trivial component of the total run time. Because random pivot selection can be slow, consider using a median-of-three technique for selecting the pivot.
Also, the call to the swap() function isn't fast in CPython. Inlining that code should give you a speed boost.
As you can see, there is a lot more to optimizing Python than just selecting a good algorithm. Hope this answer gets you further to your goal :-)

You will gain a small speed-up by moving to iteration instead of recursion, although the large part of this is probably due to the native code being very fast.
I illustrate this with reference to MergeSort. Apologies for not using QuickSort - they work with about the same speed but MergeSort takes a little less time to wrap your head around, and the iterative version is more easy to demonstrate.
Essentially, MergeSort sorts a string by breaking it in half, sorting the two separately (using itself, of course!), and combining the results - sorted lists can be merges in O(n) time so this results in overall O(n log n) performance.
Here is a simple recursive MergeSort algorithm:
def mergeSort(theList):
if len(theList) == 1:
return theList
theLength = int(len(theList)/2)
return mergeSorted( mergeSort(theList[0:theLength]), mergeSort(theList[theLength:]) )
def mergeSorted(theList1,theList2):
sortedList = []
counter1 = 0
counter2 = 0
while True:
if counter1 == len(theList1):
return sortedList + theList2[counter2:]
if counter2 == len(theList2):
return sortedList + theList1[counter1:]
if theList1[counter1] < theList2[counter2]:
sortedList.append(theList1[counter1])
counter1 += 1
else:
sortedList.append(theList2[counter2])
counter2 += 1
Exactly as you found, this is beaten into the ground by the in-built sorting algorithm:
import timeit
setup = """from __main__ import mergeSortList
import random
theList = [random.random() for x in xrange(1000)]"""
timeit.timeit('theSortedList1 = sorted(theList)', setup=setup, number=1000)
#0.33633776246006164
timeit.timeit('theSortedList1 = mergeSort(theList)', setup=setup, number=1000)
#8.415547955717784
However, a bit of a time boost can be had by eliminating the recursive function calls in the mergeSort function (this also avoids the dangers of hitting recursion limits). This is done by starting at the base elements, and combining them pairwise, a bottom-up approach instead of a top-down approach. For example:
def mergeSortIterative(theList):
theNewList = map(lambda x: [x], theList)
theLength = 1
while theLength < len(theList):
theNewNewList = []
pairs = zip(theNewList[::2], theNewList[1::2])
for pair in pairs:
theNewNewList.append( mergeSorted( pair[0], pair[1] ) )
if len(pairs) * 2 < len(theNewList):
theNewNewList.append(theNewList[-1])
theLength *= 2
theNewList = theNewNewList
return theNewNewList[0]
Now the growing sorted list elements are stored at each iteration, greatly reducing the memory requirements and eliminating the recursive function calls. Running this gives about a 15% speed boost in my running time - and this was a quickly-thrown-together version
setup = """from __main__ import mergeSortIterative
import random
theList = [random.random() for x in xrange(1000)]"""
timeit.timeit('theSortedList1 = mergeSortIterative(theList)', setup=setup, number=1000)
#7.1798827493580575
So I'm still no-where near the in-built version, but a little bit better than I was doing before.
A recipe for iterative QuickSort can be found here.

Related

Python - Modifying beginning and end of list in one iteration vs modifying one end at a time using two iterations speed

While looking at the discussion for Product of Array Except Self I came to try two different techniques and I can not wrap my head around this. Why does modifying a list at the beginning and end using one iteration (Solution 1) slower than modifying a list from the beginning and then in reverse using two for loops (Solution 2)?
Where my confusion comes in:
Initializing the list to all 1's is O(n) from my understanding (this is the same in both algos so it shouldn't make a difference)
Setting/getting an item in a list is O(n)
Looping through a list is O(n) each time
Therefore Solution 1 should be :
O(n) for init list
O(n) for the iteration
a few O(1)s for getting/setting elements in the list
= O(n + n) = O(n)
Solution 2 would then be:
O(n) for init list
O(n) for the first iteration
O(n) for the second iteration
a few O(1)s for getting/setting elements in the list
= O(n + n + n) = O(n)
So they are both technically O(n), but Solution 2 still has a second iteration and yet somehow it runs FASTER! Not the same, FASTER!
I also know that Leetcode isn't the best at judging runtime speed since it varies greatly, but i have run this so many times and it always shows Solution 2 running faster. I don't understand what I'm missing here.
Solution 1
ans = [1] * len(nums)
left = 1
right = 1
for i in range(len(nums)):
ans[i] *= left
ans[-1-i] *= right
left *= nums[i]
right *= nums[-1-i]
return ans
Solution 2
prod = 1
ans = [1]*len(nums)
for x in range(0,len(nums)):
ans[x] *= prod
prod *= nums[x]
prod = 1
for x in range(len(nums)-1, -1 , -1):
ans[x] *= prod
prod *= nums[x]
return ans

Is it faster?
It's not necessarily faster. You didn't specify how you got your results. I tested it in CPython and PyPy and while solution 2 wins in CPython, solution 1 wins in PyPy. (See below for the test code.)
$ python --version
Python 3.7.6
$ python test.py
Solution 1 time: 2.9916164000001118
Solution 2 time: 2.6632864999996855
$ python test.py
Solution 1 time: 2.857291400000122
Solution 2 time: 2.854712400000153
$ python test.py
Solution 1 time: 2.7937206999999944
Solution 2 time: 2.5544856999999865
$ pypy3 --version
Python 3.6.12 (7.3.3+dfsg-3~ppa1~ubuntu18.04, Feb 25 2021, 20:14:47)
$ pypy3 test.py
Solution 1 time: 0.07995370000026014
Solution 2 time: 0.09105890000000727
$ pypy3 test.py
Solution 1 time: 0.07695659999990312
Solution 2 time: 0.08727580000004309
$ pypy3 test.py
Solution 1 time: 0.07859189999999217
Solution 2 time: 0.09762659999978496
Why might solution 2 be faster in CPython?
Well, notice also that CPython is vastly slower than PyPy. Regular CPython is interpreted. Interpreting Python code is very, very slow compared to running compiled code. The Python interpreter loops, executing bytecode operations. Every opcode in the loop must be interpreted every time. However, the infrastructure of the for loop itself, that is, the code that invokes the iterator and checks if it should continue looping, and in particular the range iterator itself are not executing any interpreted Python code at all. They are all pre-compiled native code. They have negligible cost compared to the execution of your Python code instructions inside the loop. And solution 1 does more work inside the loop. Not a lot more, but it has to do two extra subtractions every time.
In contrast, in PyPy, you can expect that everything gets compiled to native code. This makes your Python code in the body of the loop much faster. So much faster that it's now comparable in cost to the code implementing the range iterator. At this point the fact that you're not applying the iterator twice does become a big enough deal to let solution 1 win out.
A word of caution
All that said, I might be wrong! It's really hard to know for sure how code will perform, and I've not done the exhaustive dissection required to be certain here - I've just come up with a plausible explanation. It's also possible that one algorithm is less cache-friendly than the other, or pipelines better in the CPU. I just don't think that kind of thing is likely to make much difference in interpreted CPython.
I did find that if you change solution 1 to remove the subtractions, it runs in about the same time as solution 2 in CPython. Of course, it gets the wrong answer, but that does tend to make me thing that my explanation is a reasonable one.
An aside on time complexity
You linked to a Leetcode problem statement which constrains the input so that it is guaranteed that every prefix and every suffix of nums has a product that fits in a 32-bit integer. (I assume they are signed integers.) This is very constraining on the size and values in nums. Unless there are 0s in nums (which make the problem trivial) then nums can't have more than about 31 values that aren't 1 or -1. This is why I picked a test array with 20 integers in the range [1..7]. However, this is very small to be talking about asymptotic complexity. Asymptotic complexity tells you something about how an algorithm behaves when N becomes "large enough" that you can ignore the constant time factors. If N has a small upper limit you might never reach "large enough".
Even where algorithmic complexity is a useful tool, it still can't tell you which of two O(N) algorithms will be faster. All it can tell you is that if two algorithms have different time complexity, then there is some N above which the algorithm with lower complexity will consistently be faster.
Test code
import timeit
import random
def s1(nums):
ans = [1] * len(nums)
left = 1
right = 1
for i in range(len(nums)):
ans[i] *= left
ans[-1-i] *= right
left *= nums[i]
right *= nums[-1-i]
return ans
def s2(nums):
prod = 1
ans = [1]*len(nums)
for x in range(0,len(nums)):
ans[x] *= prod
prod *= nums[x]
prod = 1
for x in range(len(nums)-1, -1 , -1):
ans[x] *= prod
prod *= nums[x]
return ans
def main():
r = random.Random(12345)
nums = [r.randint(1,7) for i in range(20)]
print('Solution 1 time:', timeit.timeit(lambda:s1(nums),number=500000))
print('Solution 2 time:', timeit.timeit(lambda:s2(nums),number=500000))
if __name__ == '__main__':
main()

How to reduce the runtime of my code to calculate the sum of all prime numbers under 20000000

The runtime of the below code is really long, is there a more efficient way of calculating the sum of all prime numbers under 2million?
primeNumberList = []
previousNumberList = []
for i in range(2,2000000):
for x in range(2,i):
previousNumberList.append(x)
if all(i % n > 0 for n in previousNumberList):
primeNumberList.append(i)
previousNumberList = []
print(sum(primeNumberList))

You can optimize it in a bunch of interesting ways.
First, look at algorithmic optimizations.
Use algorithms that find prime numbers faster. (See here).
Use something like memoization to prevent unnecessary computation.
If memory is not an issue, figure out how to exchange memory for runtime.
Next, look at systems level optimizations.
Divide it over multiple processes (multiple threads won't add much easily due to Python's Global Interpreter Lock). You can do this using GRPC on one host, or PySpark etc. if you are using multiple hosts.
Finally, look at stuff like loop unrolling etc.
Good luck!

Start with a faster algorithm for calculating prime numbers. A really good survey is here: Fastest way to list all primes below N
This one (taken from one of the answers of that post) calculates in under a second on my year-old iMac:
def primes(n):
""" Returns a list of primes < n """
sieve = [True] * n
for i in range(3,int(n**0.5)+1,2):
if sieve[i]:
sieve[i*i::2*i]=[False]*((n-i*i-1)//(2*i)+1)
return [2] + [i for i in range(3,n,2) if sieve[i]]
print(sum(primes(20000000)))

As long as you have the memory space for it, Eratosthene's sieve is hard to beat when it comes to finding prime numbers:
def sumPrimes(N):
prime = [True]*(N+1)
for n in range(3,int(N**(1/2)+1),2):
if prime[n] : prime[n*n:N+1:n*2] = [False]*len(range(n*n,N+1,n*2))
return sum(n for n in range(3,N+1,2) if prime[n]) + 2*(N > 1)

Efficiency of Python "in" keyword for sorted list

If I have a list that is already sorted and use the in keyword, for example:
a = [1,2,5,6,8,9,10]
print 8 in a
I think this should do a sequential search but can I make it faster by doing binary search?
Is there a pythonic way to search in a sorted list?

The standard library has the bisect module which supports searching in sorted sequences.
However, for small lists, I would bet that the C implementation behind the in operator would beat out bisect. You'd have to measure with a bunch of common cases to determine the real break-even point on your target hardware...
It's worth noting that if you can get away with an unordered iterable (i.e. a set), then you can do the lookup in O(1) time on average (using the in operator), compared to bisection on a sequence which is O(logN) and the in operator on a sequence which is O(N). And, with a set you also avoid the cost of sorting it in the first place :-).

There is a binary search for Python in the standard library, in module bisect. It does not support in/contains as is, but you can write a small function to handle it:
from bisect import bisect_left
def contains(a, x):
"""returns true if sorted sequence `a` contains `x`"""
i = bisect_left(a, x)
return i != len(a) and a[i] == x
Then
>>> contains([1,2,3], 3)
True
>>> contains([1,2,3], 4)
False
This is not going to be very speedy though, as bisect is written in Python, and not in C, so you'd probably find sequential in faster for quite a lot cases. bisect has had an optional C acceleration in CPython since Python 2.4.
It is hard to time the exact break-even point in CPython. This is because the code is written in C; if you check for a value that is greater to or less than any value in the sequence, then the CPU's branch prediction will play tricks on you, and you get:
In [2]: a = list(range(100))
In [3]: %timeit contains(a, 101)
The slowest run took 8.09 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 370 ns per loop
Here, the best of 3 is not representative of the true running time of the algorithm.
But tweaking tests, I've reached the conclusion that bisecting might be faster than in for lists having as few as 30 elements.
However, if you're doing really many in operations you ought to use a set; you can convert the list once into a set (it does not even be sorted) and the in operation will be asymptotically faster than any binary search ever would be:
>>> a = [10, 6, 8, 1, 2, 5, 9]
>>> a_set = set(a)
>>> 10 in a_set
True
On the other hand, sorting a list has greater time-complexity than building a set, so most of the time a set would be the way to go.

I would go with this pure one-liner (providing bisect is imported):
a and a[bisect.bisect_right(a, x) - 1] == x
Stress test:
from bisect import bisect_right
from random import randrange
def contains(a, x):
return a and a[bisect.bisect_right(a, x) - 1] == x
for _ in range(10000):
a = sorted(randrange(10) for _ in range(10))
x = randrange(-5, 15)
assert (x in a) == contains(a, x), f"Error for {x} in {a}"
... doesn't print anything.

Benefits of Quichesort

I created this program for an assignment in which we were required to create an implementation of Quichesort. This is a hybrid sorting algorithm that uses Quicksort until it reaches a certain recursion depth (log2(N), where N is the length of the list), then switches to Heapsort, to avoid exceeding the maximum recursion depth.
While testing my implementation, I discovered that although it generally performed better than regular Quicksort, Heapsort consistently outperformed both. Can anyone explain why Heapsort performs better, and under what circumstances Quichesort would be better than both Quicksort and Heapsort?
Note that for some reason, the assignment referred to the algorithm as "Quipsort".
Edit: Apparently, "Quichesort" is actually identical to
Introsort.
I also noticed that a logic error in my medianOf3() function was
causing it to return the wrong value for certain inputs. Here is an improved
version of the function:
def medianOf3(lst):
"""
From a lst of unordered data, find and return the the median value from
the first, middle and last values.
"""
first, last = lst[0], lst[-1]
if len(lst) <= 2:
return min(first, last)
middle = lst[(len(lst) - 1) // 2]
return sorted((first, middle, last))[1]
Would this explain the algorithm's relatively poor performance?
Code for Quichesort:
import heapSort # heapSort
import math # log2 (for quicksort depth limit)
def medianOf3(lst):
"""
From a lst of unordered data, find and return the the median value from
the first, middle and last values.
"""
first, last = lst[0], lst[-1]
if len(lst) <= 2:
return min(first, last)
median = lst[len(lst) // 2]
return max(min(first, median), min(median, last))
def partition(pivot, lst):
"""
partition: pivot (element in lst) * List(lst) ->
tuple(List(less), List(same, List(more))).
Where:
List(Less) has values less than the pivot
List(same) has pivot value/s, and
List(more) has values greater than the pivot
e.g. partition(5, [11,4,7,2,5,9,3]) == [4,2,3], [5], [11,7,9]
"""
less, same, more = [], [], []
for val in lst:
if val < pivot:
less.append(val)
elif val > pivot:
more.append(val)
else:
same.append(val)
return less, same, more
def quipSortRec(lst, limit):
"""
A non in-place, depth limited quickSort, using median-of-3 pivot.
Once the limit drops to 0, it uses heapSort instead.
"""
if lst == []:
return []
if limit == 0:
return heapSort.heapSort(lst)
limit -= 1
pivot = medianOf3(lst)
less, same, more = partition(pivot, lst)
return quipSortRec(less, limit) + same + quipSortRec(more, limit)
def quipSort(lst):
"""
The main routine called to do the sort. It should call the
recursive routine with the correct values in order to perform
the sort
"""
depthLim = int(math.log2(len(lst)))
return quipSortRec(lst, depthLim)
Code for Heapsort:
import heapq # mkHeap (for adding/removing from heap)
def heapSort(lst):
"""
heapSort(List(Orderable)) -> List(Ordered)
performs a heapsort on 'lst' returning a new sorted list
Postcondition: the argument lst is not modified
"""
heap = list(lst)
heapq.heapify(heap)
result = []
while len(heap) > 0:
result.append(heapq.heappop(heap))
return result

The basic facts are as follows:
Heapsort has worst-case O(n log(n)) performance but tends to be slow in practice.
Quicksort has O(n log(n)) performance on average, but O(n^2) in the worst case but is fast in practice.
Introsort is intended to harness the fast-in-practice performance of quicksort, while still guaranteeing the worst-case O(n log(n)) behavior of heapsort.
One question to ask is, why is quicksort faster "in practice" than heapsort? This is a tough one to answer, but most answers point to how quicksort has better spatial locality, leading to fewer cache misses. However, I'm not sure how applicable this is to Python, as it is running in an interpreter and has a lot more junk going on under the hood than other languages (e.g. C) that could interfere with cache performance.
As to why your particular introsort implementation is slower than Python's heapsort - again, this is difficult to determine. First of all, note that the heapq module is written in Python, so it's on a relatively even footing with your implementation. It may be that creating and concatenating many smaller lists is costly, so you could try rewriting your quicksort to act in-place and see if that helps. You could also try tweaking various aspects of the implementation to see how that affects performance, or run the code through a profiler and see if there are any hot spots. But in the end I think it's unlikely you'll find a definite answer. It may just boil down to which operations are particularly fast or slow in the Python interpreter.

Higher Order Functions vs loops - running time & memory efficiency?

Does using Higher Order Functions & Lambdas make running time & memory efficiency better or worse?
For example, to multiply all numbers in a list :
nums = [1,2,3,4,5]
prod = 1
for n in nums:
prod*=n
vs
prod2 = reduce(lambda x,y:x*y , nums)
Does the HOF version have any advantage over the loop version other than it's lesser lines of code/uses a functional approach?
EDIT:
I am not able to add this as an answer as I don't have the required reputation.
I tied to profile the loop & HOF approach using timeit as suggested by #DSM
def test1():
s= """
nums = [a for a in range(1,1001)]
prod = 1
for n in nums:
prod*=n
"""
t = timeit.Timer(stmt=s)
return t.repeat(repeat=10,number=100)
def test2():
s="""
nums = [a for a in range(1,1001)]
prod2 = reduce(lambda x,y:x*y , nums)
"""
t = timeit.Timer(stmt=s)
return t.repeat(repeat=10,number=100)
And this is my result:
Loop:
[0.08340786340144211, 0.07211491653462579, 0.07162720686361926, 0.06593182661083438, 0.06399049758613146, 0.06605228229559557, 0.06419744588664211, 0.0671893658461038, 0.06477527090075941, 0.06418023793167627]
test1 average: 0.0644778902685
HOF:
[0.0759414223099324, 0.07616920129277016, 0.07570730355421262, 0.07604965128984942, 0.07547092059389193, 0.07544737286604364, 0.075532959799953, 0.0755039779810629, 0.07567424616704144, 0.07542563650187661]
test2 average: 0.0754917512762
On an average loop approach seems to be faster than using HOFs.

Higher-order functions can be very fast.
For example, map(ord, somebigstring) is much faster than the equivalent list comprehension [ord(c) for c in somebigstring]. The former wins for three reasons:
map() pre-sizes the result string to the length of somebigstring. In contrast, the list-comprehension must make many calls to realloc() as it grows.
map() only has to do one lookup for ord, first checking globals, then checking and finding it in builtins. The list comprehension has to repeat this work on every iteration.
The inner loop for map runs at C speed. The loop body for the list comprehension is a series of pure Python steps that each need to be dispatched or handled by the eval-loop.
Here are some timings to confirm the prediction:
>>> from timeit import Timer
>>> print min(Timer('map(ord, s)', 's="x"*10000').repeat(7, 1000))
0.808364152908
>>> print min(Timer('[ord(c) for c in s]', 's="x"*10000').repeat(7, 1000))
1.2946639061

from my experience loops can do things very fast , provided they are not nested too deeply , and with complex higher math operations , for simple operations and a Single layer of loops it can be as fast as any other way , maybe faster , so long as only integers are used as the index to the loop or loops, it would actually depend on what you are doing too
Also it might very well be that the higher order function will produce just as many loops
as the loop program version and might even be a little slower , you would have to time them both...just to be sure.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.