Very fast rolling hash in Python?

Very fast rolling hash in Python? - python

I'm writing a toy rsync-like tool in Python. Like many similar tools, it will first use a very fast hash as the rolling hash, and then a SHA256 once a match has been found (but the latter is out of topic here: SHA256, MDA5, etc. are too slow as a rolling hash).
I'm currently testing various fast hash methods:
import os, random, time
block_size = 1024 # 1 KB blocks
total_size = 10*1024*1024 # 10 MB random bytes
s = os.urandom(total_size)
t0 = time.time()
for i in range(len(s)-block_size):
h = hash(s[i:i+block_size])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
I get: 0.8 MB/s ... so the Python built-in hash(...) function is too slow here.
Which solution would allow a faster hash of at least 10 MB/s on a standard machine?
I tried with
import zlib
...
h = zlib.adler32(s[i:i+block_size])
but it's not much better (1.1 MB/s)
I tried with sum(s[i:i+block_size]) % modulo and it's slow too
Interesting fact: even without any hash fonction, the loop itself is slow!
t0 = time.time()
for i in range(len(s)-block_size):
s[i:i+block_size]
I get: 3.0 MB/s only! So the simpe fact of having a loop accessing to a rolling block on s is already slow.
Instead of reinventing the wheel and write my own hash / or use custom Rabin-Karp algorithms, what would you suggest, first to speed up this loop, and then as a hash?
Edit: (Partial) solution for the "Interesting fact" slow loop above:
import os, random, time, zlib
from numba import jit
#jit()
def main(s):
for i in range(len(s)-block_size):
block = s[i:i+block_size]
total_size = 10*1024*1024 # 10 MB random bytes
block_size = 1024 # 1 KB blocks
s = os.urandom(total_size)
t0 = time.time()
main(s)
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
With Numba, there is a massive improvement: 40.0 MB/s, but still no hash done here. At least we're not blocked at 3 MB/s.

Instead of reinventing the wheel and write my own hash / or use custom
Rabin-Karp algorithms, what would you suggest, first to speed up this
loop, and then as a hash?
It's always great to start with this mentality, but seems that you didn't get the idea of rolling hashes.
What makes a hashing function great for rolling is it's capability of reuse the previous processing.
A few hash functions allow a rolling hash to be computed very
quickly—the new hash value is rapidly calculated given only the old
hash value, the old value removed from the window, and the new value
added to the window.
From the same wikipedia page
It's hard to compare performance across different machines without timeit, but I changed your script to use a simple polynomial hashing with a prime modulo (would be even faster to work with a Mersene prime, because the modulo operation could be done with binary operations):
import os, random, time
block_size = 1024 # 1 KB blocks
total_size = 10*1024*1024 # 10 MB random bytes
s = os.urandom(total_size)
base = 256
mod = int(1e9)+7
def extend(previous_mod, byte):
return ((previous_mod * base) + ord(byte)) % mod
most_significant = pow(base, block_size-1, mod)
def remove_left(previous_mod, byte):
return (previous_mod - (most_significant * ord(byte)) % mod) % mod
def start_hash(bytes):
h = 0
for b in bytes:
h = extend(h, b)
return h
t0 = time.time()
h = start_hash(s[:block_size])
for i in range(block_size, len(s)):
h = remove_left(h, s[i - block_size])
h = extend(h, s[i])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
Apparently you achieved quite a improvement with Numba and it may speed up this code as well.
To extract more performance you may want to write a C (or other low-level language as Rust) functions to process a big slice of the list at time and returns an array with the hashes.
I'm creating a rsync-like tool as well, but as I'm writing in Rust performance in this level isn't a concern of mine. Instead, I'm following the tips of the creator of rsync and trying to parallelize everything I can, a painful task to do in Python (probably impossible without Jython).

what would you suggest, first to speed up this loop, and then as a hash?
Increase the blocksize. The smaller your blocksize the more python you'll be executing per byte, and the slower it will be.
edit: your range has the default step of 1 and you don't multiply i by block_size, so instead of iterating on 10*1024 non-overlapping blocks of 1k, you're iterating on 10 million - 1024 mostly overlapping blocks

First, your slow loop. As has been mentioned you are slicing a new block for every byte (less blocksize) in the stream. This is a lot of work on both cpu and memory.
A faster loop would be to pre chunk the data into parallel bits.
chunksize = 4096 # suggestion
# roll the window over the previous chunk's last block into the new chunk
lastblock = None
for readchunk in read_file_chunks(chunksize):
for i in range(0, len(readchunk), blocksize):
# slice a block only once
newblock = readchunk[i:blocksize]
if lastblock:
for bi in range(len(newblock)):
outbyte = lastblock[bi]
inbyte = newblock[bi]
# update rolling hash with inbyte and outbyte
# check rolling hash for "hit"
else:
pass # calculate initial weak hash, check for "hit"
lastblock = newblock
Chunksize should be a multiple of blocksize
Next, you were calculating a "rolling hash" over the entirety of each block in turn, instead of updating the hash byte by byte in "rolling" fashion. That is immensely slower. The above loop forces you to deal with the bytes as they go in and out of the window. Still, my trials show pretty poor throughput (~3Mbps~ edit: sorry that's 3MiB/s) even with a modest number of arithmetic operations on each byte. Edit: I initially had a zip() and that appears rather slow. I got more than double the throughout for the loop alone without the zip (current code above)
Python is single threaded and interpreted. I see one cpu pegged and that is the bottleneck. To get faster you'll want multiple threads (subprocess) or break into C, or both. Simply running the math in C would probably be enough I think. (Haha, "simply")

Related

What is the time complexity of adding and retrieving strings from hashset [duplicate]

This question already has answers here:
Time complexity of python set operations?
(3 answers)
Closed 3 years ago.
Say we add a group of long strings to a hashset, and then test if some string already exists in this hashset. Is the time complexity going to be constant for adding and retrieving operations? Or does it depend on the length of the strings?
For example, if we have three strings.
s1 = 'abcdefghijklmn'
s2 = 'dalkfdboijaskjd'
s3 = 'abcdefghijklmn'
Then we do:
pool = set()
pool.add(s1)
pool.add(s2)
print s3 in pool # => True
print 'zzzzzzzzzz' in pool # => False
Would time complexity of the above operations be a factor of the string length?
Another question is that what if we are hashing a tuple? Something like (1,2,3,4,5,6,7,8,9)?
I appreciate your help!
==================================
I understand that there are resources around like this one that is talking about why hashing is constant time and collision issues. However, they usually assumed that the length of the key can be neglected. This question asks if hashing still has constant time when the key has a length that cannot be neglected. For example, if we are to judge N times if a key of length K is in the set, is the time complexity O(N) or O(N*K).

One of the best ways to answer something like this is to dig into the implementation :)
Notwithstanding some of that optimization magic described in the header of setobject.c, adding an object into a set reuses hashes from strings where hash() has already been once called (recall, strings are immutable), or calls the type's hash implementation.
For Unicode/bytes objects, we end up via here to _Py_HashBytes, which seems to have an optimization for small strings, otherwise it uses the compile-time configured hash function, all of which naturally are somewhat O(n)-ish. But again, this seems to only happen once per string object.
For tuples, the hash implementation can be found here – apparently a simplified, non-cached xxHash.
However, once the hash has been computed, the time complexity for sets should be around O(1).
EDIT: A quick, not very scientific benchmark:
import time
def make_string(c, n):
return c * n
def make_tuple(el, n):
return (el,) * n
def hashtest(gen, n):
# First compute how long generation alone takes
gen_time = time.perf_counter()
for x in range(n):
gen()
gen_time = time.perf_counter() - gen_time
# Then compute how long hashing and generation takes
hash_and_gen_time = time.perf_counter()
for x in range(n):
hash(gen())
hash_and_gen_time = time.perf_counter() - hash_and_gen_time
# Return the two
return (hash_and_gen_time, gen_time)
for gen in (make_string, make_tuple):
for obj_length in (10000, 20000, 40000):
t = f"{gen.__name__} x {obj_length}"
# Using `b'hello'.decode()` here to avoid any cached hash shenanigans
hash_and_gen_time, gen_time = hashtest(
lambda: gen(b"hello".decode(), obj_length), 10000
)
hash_time = hash_and_gen_time - gen_time
print(t, hash_time, obj_length / hash_time)
outputs
make_string x 10000 0.23490356100000004 42570.66158311665
make_string x 20000 0.47143921999999994 42423.284172241765
make_string x 40000 0.942087403 42458.905482254915
make_tuple x 10000 0.45578034300000025 21940.393335480014
make_tuple x 20000 0.9328520900000008 21439.62608263008
make_tuple x 40000 1.8562772150000004 21548.505620158674
which basically says hashing sequences, be they strings or tuples, is linear time, yet hashing strings is a lot faster than hashing tuples.
EDIT 2: this proves strings and bytestrings cache their hashes:
import time
s = ('x' * 500_000_000)
t0 = time.perf_counter()
a = hash(s)
t1 = time.perf_counter()
print(t1 - t0)
t0 = time.perf_counter()
b = hash(s)
t2 = time.perf_counter()
assert a == b
print(t2 - t0)
outputs
0.26157095399999997
1.201999999977943e-06

Strictly speaking it depends on the implementation of the hash set and the way you're using it (there may be cleverness that will optimize some of the time away in specialized circumstances), but in general, yes, you should expect that it will take O(n) time to hash a key to do an insert or lookup where n is the size of the key. Usually hash sets are assumed to be O(1), but there's an implicit assumption there that the keys are of fixed size and that hashing them is a O(1) operation (in other words, there's an assumption that the key size is negligible compared to the number of items in the set).
Optimizing the storage and retrieval of really big chunks of data is why databases are a thing. :)

Average case is O(1).
However, the worst case is O(n), with n being the number of elements in the set. This case is caused by hashing collisions.
you can read more about it in here
https://www.geeksforgeeks.org/internal-working-of-set-in-python/

Wiki is your friend
https://wiki.python.org/moin/TimeComplexity
for the operations above it seems that they are all O(1) for a set

Fastest way to compute e^x?

What is the fastest way to compute e^x, given x can be a floating point value.
Right now I have used the python's math library to compute this, below is the complete code where in result = -0.490631 + 0.774275 * math.exp(0.474907 * sum) is the main logic, rest is file handling code which the question demands.
import math
import sys
def sum_digits(n):
r = 0
while n:
r, n = r + n % 10, n // 10
return r
def _print(string):
fo = open("output.txt", "w+")
fo.write(string)
fo.close()
try:
f = open('input.txt')
except IOError:
_print("error")
sys.exit()
data = f.read()
num = data.split('\n', 1)[0]
try:
val = int(num)
except ValueError:
_print("error")
sys.exit()
sum = sum_digits(int(num))
f.close()
if (sum == 2):
_print("1")
else:
result = -0.490631 + 0.774275 * math.exp(0.474907 * sum)
_print(str(math.ceil(result)))
The rvalue of result is the equation of curve (which is the solution to a programming problem) which I derived from wolfarm-mathematica using my own data set.
But this doesn't seem to pass the par criteria of the assessment !
I have also tried the newton-raphson way but the convergence for larger x is causing the problem, other than that, calculating the natural log ln(x) is a challenge there again !
I don't have any language constraint so any solution is acceptable. Also if the python's math library is fastest as some of the comments says then can anyone give an insight on the time complexity and execution time of this program, in short the efficiency of the program ?

I don't know if the exponential curve math is accurate in this code, but it certainly isn't the slow point.
First, you read the input data in one read call. It does have to be read, but that loads the entire file. The next step takes the first line only, so it would seem more appropriate to use readline. That split itself is O(n) where n is the file size, at least, which might include data you were ignoring since you only process one line.
Second, you convert that line into an int. This probably requires Python's long integer support, but the operation could be O(n) or O(n^2). A single pass algorithm would multiply the accumulated number by 10 for each digit, allocating one or two new (longer) longs each time.
Third, sum_digits breaks that long int down into digits again. It does so using division, which is expensive, and two operations as well, rather than using divmod. That's O(n^2), because each division has to process every higher digit for each digit. And it's only needed because of the conversion you just did.
Summing the digits found in a string is likely easier done with something like sum(int(c) for c in l if c.isdigit()) where l is the input line. It's not particularly fast, as there's quite a bit of overhead in the digit conversions and the sum might grow large, but it does make a single pass with a fairly tight loop; it's somewhere between O(n) and O(n log n), depending on the length of the data, because the sum might grow large itself.
As for the unknown exponential curve, the existence of an exception for a low number is concerning. There's likely some other option that's both faster and more accurate if the answer's an integer anyway.
Lastly, you have at least four distinct output data formats: error, 2, 3.0, 3e+20. Do you know which of these is expected? Perhaps you should be using formatted output rather than str to convert your numbers.
One extra note: If the data is really large, processing it in chunks will definitely speed things up (instead of running out of memory, needing to swap, etc). As you're looking for a digit sum your size complexity can be reduced from O(n) to O(log n).

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.

If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.

Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Pushing Radix Sort (and python) to its limits

I've been immensely frustrated with many of the implementations of python radix sort out there on the web.
They consistently use a radix of 10 and get the digits of the numbers they iterate over by dividing by a power of 10 or taking the log10 of the number. This is incredibly inefficient, as log10 is not a particularly quick operation compared to bit shifting, which is nearly 100 times faster!
A much more efficient implementation uses a radix of 256 and sorts the number byte by byte. This allows for all of the 'byte getting' to be done using the ridiculously quick bit operators. Unfortunately, it seems that absolutely nobody out there has implemented a radix sort in python that uses bit operators instead of logarithms.
So, I took matters into my own hands and came up with this beast, which runs at about half the speed of sorted on small arrays and runs nearly as quickly on larger ones (e.g. len around 10,000,000):
import itertools
def radix_sort(unsorted):
"Fast implementation of radix sort for any size num."
maximum, minimum = max(unsorted), min(unsorted)
max_bits = maximum.bit_length()
highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1
min_bits = minimum.bit_length()
lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1
sorted_list = unsorted
for offset in xrange(lowest_byte, highest_byte):
sorted_list = radix_sort_offset(sorted_list, offset)
return sorted_list
def radix_sort_offset(unsorted, offset):
"Helper function for radix sort, sorts each offset."
byte_check = (0xFF << offset*8)
buckets = [[] for _ in xrange(256)]
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
return list(itertools.chain.from_iterable(buckets))
This version of radix sort works by finding which bytes it has to sort by (if you pass it only integers below 256, it'll sort just one byte, etc.) then sorting each byte from LSB up by dumping them into buckets in order then just chaining the buckets together. Repeat this for each byte that needs to be sorted and you have your nice sorted array in O(n) time.
However, it's not as fast as it could be, and I'd like to make it faster before I write about it as a better radix sort than all the other radix sorts out there.
Running cProfile on this tells me that a lot of time is being spent on the append method for lists, which makes me think that this block:
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
in radix_sort_offset is eating a lot of time. This is also the block that, if you really look at it, does 90% of the work for the whole sort. This code looks like it could be numpy-ized, which I think would result in quite a performance boost. Unfortunately, I'm not very good with numpy's more complex features so haven't been able to figure that out. Help would be very appreciated.
I'm currently using itertools.chain.from_iterable to flatten the buckets, but if anyone has a faster suggestion I'm sure it would help as well.
Originally, I had a get_byte function that returned the nth byte of a number, but inlining the code gave me a huge speed boost so I did it.
Any other comments on the implementation or ways to squeeze out more performance are also appreciated. I want to hear anything and everything you've got.

You already realized that
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
is where most of the time goes - good ;-)
There are two standard tricks for speeding that kind of thing, both having to do with moving invariants out of loops:
Compute "offset*8" outside the loop. Store it in a local variable. Save a multiplication per iteration.
Add bucketappender = [bucket.append for bucket in buckets] outside the loop. Saves a method lookup per iteration.
Combine them, and the loop looks like:
for num in unsorted:
bucketappender[(num & byte_check) >> ofs8](num)
Collapsing it to one statement also saves a pair of local vrbl store/fetch opcodes per iteration.
But, at a higher level, the standard way to speed radix sort is to use a larger radix. What's magical about 256? Nothing, apart from that it's convenient for bit-shifting. But so are 512, 1024, 2048 ... it's a classical time/space tradeoff.
PS: for very long numbers,
(num >> offset*8) & 0xff
will run faster. That's because your num & byte_check takes time proportional to log(num) - it generally has to create an integer about as big as num.

This is an old thread, but I came across this when looking to radix sort an array of positive integers. I was trying to see if I can do any better than the already wickedly fast timsort (hats off to you again, Tim Peters) which implements python's builtin sorted and sort! Either I don't understand certain aspects of the above code, or if I do, the code as presented above has some problems IMHO.
It only sorts bytes starting with the highest byte of the smallest item and ending with the highest byte of the biggest item. This may be okay in some cases of special data. But in general the approach fails to differentiate items which differ on account of the lower bits. For example:
arr=[65535,65534]
radix_sort(arr)
produces the wrong output:
[65535, 65534]
The range used to loop over the helper function is not correct. What I mean is that if lowest_byte and highest_byte are the same, execution of the helper function is altogether skipped. BTW I had to change xrange to range in 2 places.
With modifications to address the above 2 points, I got it to work. But it is taking 10-20 times the time of python's builtin sorted or sort! I know timsort is very efficient and takes advantage of already sorted runs in the data. But I was trying to see if I can use the prior knowledge that my data is all positive integers to some advantage in my sorting. Why is the radix sort doing so badly compared to timsort? The array sizes I was using are in the order of 80K items. Is it because the timsort implementation in addition to its algorithmic efficiency has also other efficiencies stemming from possible use of low level libraries? Or am I missing something entirely? The modified code I used is below:
import itertools
def radix_sort(unsorted):
"Fast implementation of radix sort for any size num."
maximum, minimum = max(unsorted), min(unsorted)
max_bits = maximum.bit_length()
highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1
# min_bits = minimum.bit_length()
# lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1
sorted_list = unsorted
# xrange changed to range, lowest_byte deleted from the arguments
for offset in range(highest_byte):
sorted_list = radix_sort_offset(sorted_list, offset)
return sorted_list
def radix_sort_offset(unsorted, offset):
"Helper function for radix sort, sorts each offset."
byte_check = (0xFF << offset*8)
# xrange changed to range
buckets = [[] for _ in range(256)]
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
return list(itertools.chain.from_iterable(buckets))

You could simply use one of the existing C or C++ implementations, such
as example, integer_sort from Boost.Sort or u4_sort from usort. It is surprisingly easy to call native C or C++ code from Python, see How to sort an array of integers faster than quicksort?
I totally get your frustration. Although it's been more than 2 years, numpy still does not have radix sort. I will let the NumPy developers know that they could simply grab one of the existing implementations; licensing should not be an issue.

Unexpected performance curve from CPython merge sort

I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:
import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque
def sort(unsorted):
if len(unsorted) <= 1:
return unsorted
to_merge = deque(deque([elem]) for elem in unsorted)
while len(to_merge) > 1:
left = to_merge.popleft()
right = to_merge.popleft()
to_merge.append(merge(left, right))
return to_merge.pop()
def merge(left, right):
result = deque()
while left or right:
if left and right:
elem = left.popleft() if left[0] > right[0] else right.popleft()
elif not left and right:
elem = right.popleft()
elif not right and left:
elem = left.popleft()
result.append(elem)
return result
LOOP_COUNT = 100
START_N = 1
END_N = 1000
def test(fun, test_data):
start = time.clock()
for _ in xrange(LOOP_COUNT):
fun(test_data)
return time.clock() - start
def run_test():
timings, elem_nums = [], []
test_data = random.sample(xrange(100000), END_N)
for i in xrange(START_N, END_N):
loop_test_data = test_data[:i]
elapsed = test(sort, loop_test_data)
timings.append(elapsed)
elem_nums.append(len(loop_test_data))
print "%f s --- %d elems" % (elapsed, len(loop_test_data))
plt.plot(elem_nums, timings)
plt.show()
run_test()
As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:
Things I've tried to investigate the issue:
PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious.
`
I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.
So how can this behaviour be explained and - hopefully - fixed?
UPD: changed lists to collections.deque
UPD2: added the full test code
UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.

You are simply picking up the impact of other processes on your machine.
You run your sort function 100 times for input size 1 and record the total time spent on this. Then you run it 100 times for input size 2, and record the total time spent. You continue doing so until you reach input size 1000.
Let's say once in a while your OS (or you yourself) start doing something CPU-intensive. Let's say this "spike" lasts as long as it takes you to run your sort function 5000 times. This means that the execution times would look slow for 5000 / 100 = 50 consecutive input sizes. A while later, another spike happens, and another range of input sizes look slow. This is precisely what you see in your chart.
I can think of one way to avoid this problem. Run your sort function just once for each input size: 1, 2, 3, ..., 1000. Repeat this process 100 times, using the same 1000 inputs (it's important, see explanation at the end). Now take the minimum time spent for each input size as your final data point for the chart.
That way, your spikes should only affect each input size only a few times out of 100 runs; and since you're taking the minimum, they will likely have no impact on the final chart at all.
If your spikes are really really long and frequent, you of course might want to increase the number of repetitions beyond the current 100 per input size.
Looking at your spikes, I notice the execution slows down exactly 3 times during a spike. I'm guessing the OS gives your python process one slot out of three during high load. Whether my guess is correct or not, the approach I recommend should resolve the issue.
EDIT:
I realized that I didn't clarify one point in my proposed solution to your problem.
Should you use the same input in each of your 100 runs for the given input size? Or should use 100 different (random) inputs?
Since I recommended to take the minimum of the execution times, the inputs should be the same (otherwise you'll be getting incorrect output, as you'll measuring the best-case algorithm complexity instead of the average complexity!).
But when you take the same inputs, you create some noise in your chart since some inputs are simply faster than others.
So a better solution is to resolve the system load problem, without creating the problem of only one input per input size (this is obviously pseudocode):
seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
random.seed(seed)
for i in range(inputs_per_size):
for n in range(1000):
input = generate_random_input(size = n)
execution_time = get_execution_time(input)
if runtimes[(n, i)] > execution_time:
runtimes[(n,i)] = execution_time
for n in range(1000):
runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size
Now you can use runtimes[n] to build your plot.
Of course, depending if your system is super-noisy, you might change (repeats, inputs_per_size) from (4,25) to say, (10,10), or even (25,4).

I can reproduce the spikes using your code:
You should choose an appropriate timing function (time.time() vs. time.clock() -- from timeit import default_timer), number of repetitions in a test (how long each test takes), and number of tests to choose the minimal time from. It gives you a better precision and less external influence on the results. Read the note from timeit.Timer.repeat() docs:
It’s tempting to calculate mean and standard deviation from the result
vector and report these. However, this is not very useful. In a
typical case, the lowest value gives a lower bound for how fast your
machine can run the given code snippet; higher values in the result
vector are typically not caused by variability in Python’s speed, but
by other processes interfering with your timing accuracy. So the min()
of the result is probably the only number you should be interested in.
After that, you should look at the entire vector and apply common
sense rather than statistics.
timeit module can choose appropriate parameters for you:
$ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'
Here's timeit-based performance curve:
The figure shows that sort() behavior is consistent with O(n*log(n)):
|------------------------------+-------------------|
| Fitting polynom | Function |
|------------------------------+-------------------|
| 1.00 log2(N) + 1.25e-015 | N |
| 2.00 log2(N) + 5.31e-018 | N*N |
| 1.19 log2(N) + 1.116 | N*log2(N) |
| 1.37 log2(N) + 2.232 | N*log2(N)*log2(N) |
To generate the figure I've used make-figures.py:
$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin
where:
# adapt sorting functions for make-figures.py
def msort(lists):
assert len(lists) == 1
return sort(lists[0]) # `sort()` from the question
def msort_builtin(lists):
assert len(lists) == 1
return sorted(lists[0]) # builtin
Input lists are described here (note: the input is sorted so builtin sorted() function shows expected O(N) performance).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.