I have recently learned Karatsuba multiplication. In order to fully understand this concept, I have attempted to write the code in Python and compared the running time against classical multiplication. Although the results are equal, the execution time of karatsuba is still the lowest, despite I am using recursive calls. What's wrong with my approach? Some helps would definitely allow me to understand more about algorithm design.
Best
JP
print('Karatsuba multiplication in Python')
x=raw_input("first_number=")
y=raw_input("second_number=")
print('------------------------')
x=int(x)
y=int(y)
import math
import time
def karatsuba(x,y):
x=str(x)
y=str(y)
len_x=len(x)
len_y=len(y)
if(int(len_x)==1 or int(len_y)==1):
return int(x)*int(y)
else:
B=10
exp1=int(math.ceil(len_x/2.0))
exp2=int(math.ceil(len_y/2.0))
if(exp1<exp2):
exp=exp1
else:
exp=exp2
m1=len_x-exp
m2=len_y-exp
a=karatsuba(int(x[0:m1]),int(y[0:m2]))
c=karatsuba(int(x[m1:len_x]),int(y[m2:len_y]))
b=karatsuba(int(x[0:m1])+int(x[m1:len_x]),int(y[0:m2])+int(y[m2:len_y]))-a-c
results=a*math.pow(10,2*exp) + b*math.pow(10,exp) + c
return int(results)
start_time=time.time()
ctrl = x*y
tpt=time.time() - start_time
print x,'*',y,'=',ctrl
print("--- %s seconds ---" % tpt)
start_time=time.time()
output=karatsuba(x,y)
tpt=time.time() - start_time
print 'karatsuba(',x,',',y,')=',output
print("--- %s seconds ---" % tpt)
Karatsuba multiplication has bigger overhead then classical binary multiplication
the complexity is better but the overhead cause that karatsuba is faster for bigger numbers. The better is Karatsuba coded the less the treshold operands size.
I see in your code that you convert number to string to get digit count
that is very slow operation especially for big numbers use logarithms (constant binary to decadic digits ratio) and binary bit count instead. Look here for ideas on how to code Karatsuba faster (code is in C++)
usage of pow
another slow down use table of powers of 10 instead
what are you comparing it to? (originally asked by Padraic Cunningham)
Karatsuba is faster because it does operations on lower bit-count variables !!! I do not code in Phyton at all so I can missing something (like arbitrary int) but I do not see anywhere you lower the data-type with lowering the bit-count so you will be always slower !!!. Also nice will be to add example of slow multiplication you are comparing the times to like binary or radix multiplication ... (add what you use). If you use just * operator then if you are on some bigint lib then it is possible you are comparing Karatsuba with Karatsuba or even Schönhage-Strassen
time measurement
how do you measure time ? The times should be bigger then 10ms if not loop the computations N times and measure the whole thing to avoid accuracy problems. Also take in mind scheduling granularity of OS look here if you have no idea what I am writing about
Your algorithm should multiply numbers if the are < 10:
if int(len_x) < 10 or int(len_y) < 10:
karatsuba1 is your original code. karatsuba is using if int(len_x) < 10 or int(len_y) < 10
In [17]: %timeit karatsuba1(999,999)
100000 loops, best of 3: 13.3 µs per loop
In [18]: %timeit karatsuba(999,999)
1000000 loops, best of 3: 1.77 µs per loop
Related
I am trying to find a very fast way to find the next higher powers of 2 than a very large number (1,000,000) digits. Example, i have 1009, and want to find it's next higher powers of two which is 1024 or 2**10
I tried using a loop, but for large numbers this is very, very slow
y=0
while (1<<y)<1009:
y+=1
print(1<<y)
1024
While this works, it's slow for numbers larger than a million digits. Is there a faster algorithm to find the next higher powers of 2 than a number that is large?
ANSWERED BY #JonClements
using 2**number.bit_length() works perfectly. So this will work for large numbers as well. Thanks Jon.
Here's a code example from Jon's implementation:
2**j.bit_length()
1024
Here's a code example using the shift operator
2<<(j.bit_length()-1)
1024
Here is the time difference using the million length number, the shift operator and bit_length is significantly faster:
len(str(aa))
1000000
def useBITLENGTHwithshiftoperator(hm):
return 1<<hm.bit_length()-1<<1
def useBITLENGTHwithpowersoperator(hm):
return 2**hm.bit_length()
start = time.time()
l=useBITLENGTHwithpowersoperator(aa)
end = time.time()
print(end - start)
0.014303922653198242
start = time.time()
l=useBITLENGTHwithshiftoperator(aa)
end = time.time()
print(end - start)
0.0002968311309814453
take 2^ceiling(logBase2(x)) - should work unless x is a power of 2. and you can check for that with: if x==ceiling(x).
I do not code in python but millions of digits implies bignums so:
try to look inside your bignum lib
It might return the number of words or bits used in O(1) as some number representations need it to speed up other stuff. In such case you can obtain your answer in O(1) for free.
As #JonClements suggested in a comments try bit_length() and measure if it is O(1) or O(log(n)) ...
Your while is O(n^3) instead of O(n^2)
You are bitshifting from 1 over and over again in each iteration. Why not just shift last result by 1 bit again instead? Something like
for (y=0,yy=1;yy<1009;y++,yy<<=1);
using log2 might be faster
in case the bignum class you use have it implemented correctly after some number size threshold the log2(1009) might be signifficantly faster. But that depends on the type of numbers you using and bignum implementation itself.
bit-shifting can be even faster
If you got some upper limit on your numbers you can use binary search converting your bitshifting into O(n.log2(n)).
If not you can start bitshifting by 32 bits instead of by 1 when reached target size bitshift by 1 bit. Or even use more layers like 1024/128/16/1 bits. The complexity would be still O(n^2) but the constant time would be ~1024 times smaller speeding up ~1024 times your code for big numbers...
Other option is to find the limit by shifting by 1 bit, then by 2 then by 4,8,16,32,64,... until result is bigger than your target number and from there either bitshift back or use binary search. This one would be O(n.log2(n)) even without any upper limit..
However all of these brings up much higher overhead and will slow down the processing of smaller numbers.
Constructing 2^(y-1) < x <= 2^y might be possible to enhance too. For example by using bit shifting approach to find the y you got your answer as byproduct for free. For example with floating point or fixed point numbers you can directly construct such number as computing exponent for 1 or by setting correct bit in the zero ... But for arbitrary numbers (where size of number is dynamic) i sthis much harder/slower. So all boils down what kind of bignums class you got and what values you use.
I am a bit confused on this question and would appreciate some guidance on it:
An O(n2) function takes approx 1 second to run when N is 10000.
How long will it take when N is 30000?
I was thinking that it would either be 1 second as well or 3 seconds since it is three times the size, but I am not sure if my logic is correct.
Thank you.
From Wikipedia:
In computer science, the time complexity of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the length of the string representing the input.
This way, if complexity is O(n^2) and input is 3 times greater, then time of work is 3^2 = 9 times greater. Time of work is 9 seconds.
There are many problems with the question.
First problem: time complexity does not, in general, measure time in seconds. For example, the time complexity of a sorting algorithm might refer to the number of comparisons (or swaps), and the time complexity of a hash table lookup might also refer to the number of comparisons performed. It's debatable whether the actual runtime is proportional to these measurements.
Second problem: the definition of big-O is this:
f(n) = O(g(n)) if there's N and k such that n > N implies f(n) < k*g(n).
That's a problem because even if the runtime in this case is measured in seconds, applying the definition to O(n^2) says only that for large enough n that the function is bounded above by some multiple of n^2.
So there's no guarantee that 10000 and 30000 are big enough to qualify for "big enough", and even if they were, you can't begin to estimate k from a single data point. And even with that estimate, you only get an upper bound.
What the question probably meant to ask was this:
Suppose that a function runs in time approximately proportional to n^2. It takes 1 second when n=10000. Approximately long does it take when n=30000?
Then, one can solve the equations:
1 sec ~= k * 10000^2
answer ~= k * 30000^2
= 3^2 * k * 10000^2
~= 3^2 * 1 sec
= 9 sec
I am trying to write a piece of nested loops in my algorithm, and meet some problems that the whole algorithm takes too long time due to these nested loops. I am quite new to Python (as you may find from my below unprofessional code :( ) and hopefully someone can guide me a way to speed up my code!
The whole algorithm is for fire detection in multi 1500*6400 arrays. A small contextual analyse is applied when go through the whole array. The contextual analyse is performed in a dynamically assigned windows size way. The windows size can go from 11*11 to 31*31 until the validate values inside the sampling windows are enough for the next round calculation, for example like below:
def ContextualWindows (arrb4,arrb5,pfire):
####arrb4,arrb5,pfire are 31*31 sampling windows from large 1500*6400 numpy array
i=5
while i in range (5,16):
arrb4back=arrb4[15-i:16+i,15-i:16+i]
## only output the array data when it is 'large' enough
## to have enough good quality data to do calculation
if np.ma.count(arrb4back)>=min(10,0.25*i*i):
arrb5back=arrb5[15-i:16+i,15-i:16+i]
pfireback=pfire[15-i:16+i,15-i:16+i]
canfire=0
i=20
else:
i=i+1
###unknown pixel: background condition could not be characterized
if i!=20:
canfire=1
arrb5back=arrb5
pfireback=pfire
arrb4back=arrb4
return (arrb4back,arrb5back,pfireback,canfire)
Then this dynamic windows will be feed into next round test, for example:
b4backave=np.mean(arrb4Windows)
b4backdev=np.std(arrb4Windows)
if b4>b4backave+3.5*b4backdev:
firetest=True
To run the whole code to my multi 1500*6400 numpy arrays, it took over half an hour, or even longer. Just wondering if anyone got an idea how to deal with it? A general idea which part I should put my effort to would be greatly helpful!
Many thanks!
Avoid while loops if speed is a concern. The loop lends itself to a for loop as start and end are fixed. Additionally, your code does a lot of copying which isn't really necessary. The rewritten function:
def ContextualWindows (arrb4,arrb5,pfire):
''' arrb4,arrb5,pfire are 31*31 sampling windows from
large 1500*6400 numpy array '''
for i in range (5, 16):
lo = 15 - i # 10..0
hi = 16 + i # 21..31
# only output the array data when it is 'large' enough
# to have enough good quality data to do calculation
if np.ma.count(arrb4[lo:hi, lo:hi]) >= min(10, 0.25*i*i):
return (arrb4[lo:hi, lo:hi], arrb5[lo:hi, lo:hi], pfire[lo:hi, lo:hi], 0)
else: # unknown pixel: background condition could not be characterized
return (arrb4, arrb5, pfire, 1)
For clarity I've used style guidelines from PEP 8 (like extended comments, number of comment chars, spaces around operators etc.). Copying of a windowed arrb4 occurs twice here but only if the condition is fulfilled and this will happen only once per function call. The else clause will be executed only if the for-loop has run to it's end. We don't even need a break from the loop as we exit the function altogether.
Let us know if that speeds up the code a bit. I don't think it'll be much but then again there isn't much code anyway.
I've run some time tests with ContextualWindows and variants. One i step takes about 50us, all ten about 500.
This simple iteration takes about the same time:
[np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
The iteration mechanism, and the 'copying' arrays are minor parts of the time. Where possible numpy is making views, not copies.
I'd focus on either minimizing the number of these count steps, or speeding up the count.
Comparing times for various operations on these windows:
First time for 1 step:
In [167]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,6)]
10000 loops, best of 3: 43.9 us per loop
now for the 10 steps:
In [139]: timeit [arrb4[15-i:16+i,15-i:16+i].shape for i in range(5,16)]
10000 loops, best of 3: 33.7 us per loop
In [140]: timeit [np.sum(arrb4[15-i:16+i,15-i:16+i]>500) for i in range(5,16)]
1000 loops, best of 3: 390 us per loop
In [141]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
1000 loops, best of 3: 464 us per loop
Simply indexing does not take much time, but testing for conditions takes substantially more.
cumsum is sometimes used to speed up sums over sliding windows. Instead of taking sum (or mean) over each window, you calculate the cumsum and then use the differences between the front and end of window.
Trying something like that, but in 2d - cumsum in both dimensions, followed by differences between diagonally opposite corners:
In [164]: %%timeit
.....: cA4=np.cumsum(np.cumsum(arrb4,0),1)
.....: [cA4[15-i,15-i]-cA4[15+i,15+i] for i in range(5,16)]
.....:
10000 loops, best of 3: 43.1 us per loop
This is almost 10x faster than the (nearly) equivalent sum. Values don't quite match, but timing suggest that this may be worth refining.
I'm looking to compute something like:
Where f(i) is a function that returns a real number in [-1,1] for any i in {1,2,...,5000}.
Obviously, the result of the sum is somewhere in [-1,1], but when I can't seem to be able to compute it in Python using straight forward coding, as 0.55000 becomes 0 and comb(5000,2000) becomes inf, which result in the computed sum turning into NaN.
The required solution is to use log on both sides.
That is using the identity a × b = 2log(a) + log(b), if I could compute log(a) and log(b) I could compute the sum, even if a is big and b is almost 0.
So I guess what I'm asking is if there's an easy way of computing
log2(scipy.misc.comb(5000,2000))
So I could compute my sum simply by
sum([2**(log2comb(5000,i)-5000) * f(i) for i in range(1,5000) ])
#abarnert's solution, while working for the 5000 figure addresses the problem by increasing the precision in which the comb is computed. This works for this example, but doesn't scale, as the memory required would significantly increase if instead of 5000 we had 1e7 for example.
Currently, I'm using a workaround which is ugly, but keeps memory consumption low:
log2(comb(5000,2000)) = sum([log2 (x) for x in 1:5000])-sum([log2 (x) for x in 1:2000])-sum([log2 (x) for x in 1:3000])
Is there a way of doing so in a readable expression?
The sum
is the expectation of f with respect to a binomial distribution with n = 5000 and p = 0.5.
You can compute this with scipy.stats.binom.expect:
import scipy.stats as stats
def f(i):
return i
n, p = 5000, 0.5
print(stats.binom.expect(f, (n, p), lb=0, ub=n))
# 2499.99999997
Also note that as n goes to infinity, with p fixed, the binomial distribution approaches the normal distribution with mean np and variance np*(1-p). Therefore, for large n you can instead compute:
import math
print(stats.norm.expect(f, loc=n*p, scale=math.sqrt((n*p*(1-p))), lb=0, ub=n))
# 2500.0
EDIT: #unutbu has answered the real question, but I'll leave this here in case log2comb(n, k) is useful to anyone.
comb(n, k) is n! / ((n-k)! k!), and n! can be computed using the Gamma function gamma(n+1). Scipy provides the function scipy.special.gamma. Scipy also provides gammaln, which is the log (natural log, that is) of the Gamma function.
So log(comb(n, k)) can be computed as gammaln(n+1) - gammaln(n-k+1) - gammaln(k+1)
For example, log(comb(100, 8)) (after executing from scipy.special import gammaln):
In [26]: log(comb(100, 8))
Out[26]: 25.949484949043022
In [27]: gammaln(101) - gammaln(93) - gammaln(9)
Out[27]: 25.949484949042962
and log(comb(5000, 2000)):
In [28]: log(comb(5000, 2000)) # Overflow!
Out[28]: inf
In [29]: gammaln(5001) - gammaln(3001) - gammaln(2001)
Out[29]: 3360.5943053174142
(Of course, to get the base-2 logarithm, just divide by log(2).)
For convenience, you can define:
from math import log
from scipy.special import gammaln
def log2comb(n, k):
return (gammaln(n+1) - gammaln(n-k+1) - gammaln(k+1)) / log(2)
By default, comb gives you a float64, which overflows and gives you inf.
But if you pass exact=True, it gives you a Python variable-sized int instead, which can't overflow (unless you get so ridiculously huge you run out of memory).
And, while you can't use np.log2 on an int, you can use Python's math.log2.
So:
math.log2(scipy.misc.comb(5000, 2000, exact=True))
As an alternative, you relative that n choose k is defined as n!k / k!, right? You can reduce that to ∏(i=1...k)((n+1-i)/i), which is simple to compute.
Or, if you want to avoid overflow, you can do it by alternating * (n-i) and / (k-i).
Which, of course, you can also reduce to adding and subtracting logs. I think looping in Python and computing 4000 logarithms is going to be slower than looping in C and computing 4000 multiplications, but we can always vectorize it, and then, it might be faster. Let's write it and test:
In [1327]: n, k = 5000, 2000
In [1328]: %timeit math.log2(scipy.misc.comb(5000, 2000, exact=True))
100 loops, best of 3: 1.6 ms per loop
In [1329]: %timeit np.log2(np.arange(n-k+1, n+1)).sum() - np.log2(np.arange(1, k+1)).sum()
10000 loops, best of 3: 91.1 µs per loop
Of course if you're more concerned with memory instead of time… well, this obviously makes it worse. We've got 2000 8-byte floats instead of one 608-byte integer at a time. And if you go up to 100000, 20000, you get 20000 8-byte floats instead of one 9K integer. And at 1000000, 200000, it's 200000 8-byte floats vs. one 720K integer.
I'm not sure why either way is a problem for you. Especially given that you're using a listcomp instead of a genexpr, and therefore creating an unnecessary list of 5000, 100000, or 1000000 Python floats—24MB is not a problem, but 720K is? But if it is, we can obviously just do the same thing iteratively, at the cost of some speed:
r = sum(math.log2(n-i) - math.log2(k-i) for i in range(n-k))
This isn't too much slower than the scipy solution, and it never uses more than a small constant number of bytes (a handful of Python floats). (Unless you're on Python 2, in which case… just use xrange instead of range and it's back to constant.)
As a side note, why are you using a list comprehension instead of an NumPy array with vectorized operations (for speed, and also a bit of compactness) or a generator expression instead of a list comprehension (for no memory usage at all, at no cost to speed)?
Imagine that you have some counter or other data element that needs to be stored in a field of a binary protocol. The field naturally has some fixed number n of bits and the protocol specifies that you should store the n least significant bits of the counter, so that it wraps around when it is too large. One possible way to implement that is actually taking the modulus by a power of two:
field_value = counter % 2 ** n
But this certainly isn't the most efficient way and maybe not even the easiest to understand, taking into account that the specification is talking about the least significant bits and does not mention a modulus operation. Thus, investigating alternatives is appropriate. Some examples are:
field_value = counter % (1 << n)
field_value = counter & (1 << n) - 1
field_value = counter & ~(-1 << 8)
What is the way preferred by experienced Python programmers to implement such a requirement trying to maximize code clarity without sacrificing too much performance?
There is of course no right or wrong answer to this question, so I would like to use this question to collect all the reasonable implementations of this seemingly trivial requirement. An answer should list the alternatives and shortly describe in what circumstance what alternative would preferably be used.
Bit shifting and bitwise operations are more readable in your case. Because it simply tells the reader, you are doing bitwise operations here. If you use numeric operation, the reader may not be able to understand what does it mean by moduloing that number.
Talking about performance, actually you don't have to worry too much about this in Python. Because operation to Python object itself is expensive enough, by either doing it in numeric operations or bitwise operations, it simply doesn't matter. Here I explain it in a visual way
<-------------- Python object operation cost --------------><- bit op ->
<-------------- Python object operation cost --------------><----- num op ----->
This is just a simple rough idea of what it costs to perform a simplest bit operation or number operation. As you can see Python object operation cost takes the majority, so it doesn't matter you use bitwise or numeric, the difference is too small can be ignored.
If you really need performance, you have to process massive amount of data, you should consider
Write the logic in C/C++ module for Python, you can use library like Boost.Python
Use third party library for mass number processing such as numpy
you should simply throw away the top bits.
#field_value = counter & (1 << n) - 1
field_value = counter & ALLOWED_BIT_WIDTH
If this was implemented in an embedded device, the registers used could be the limiting factor. In my experience this is way it is normally done.
The "limitation" in the protocol is a way of constraining the overhead bandwidth needed by the protocol.
It will be dependent on the python implementation probably, but in CPython 2.6, it looks like this:
In [1]: counter = 0xfedcba9876543210
In [10]: %timeit counter % 2**15
1000000 loops, best of 3: 304 ns per loop
In [11]: %timeit counter % (1<<15)
1000000 loops, best of 3: 302 ns per loop
In [12]: %timeit counter & ((1<<15)-1)
10000000 loops, best of 3: 104 ns per loop
In [13]: %timeit counter & ~(1<<15)
10000000 loops, best of 3: 170 ns per loop
In this case, counter & ((1<<15)-1) is the clear winner. Interesting is that 2**15 and 1<<15 take the same amount of time (more or less); I am guessing Python internally optimizes this case and 2**15 -> 1<<15 anyways.
I once wrote a class that lets you just do this:
bc = BitSliceLong(counter)
bc = bc[15:0]
derived from long, but it's a more general implementation (lets you take any range of the bits, not just x:0) and the extra overhead for that makes it slower by an order of magnitude, even though it's using the same method inside.
Edit: BTW, precalculating the values doesn't appear to provide any benefit - the dominant factor here is not the actual math operation. If we do
cx_mask = 2**15
counter % cx_mask
the time is the same as when it had to calculate 2**15. This was also true for our 'best case' - precalculating ((1<<15)-1) has no benefit.
Also, in the previous case, I used a large number that is implemented as a long in python. This is not really a native type - it supports arbitrary length numbers, and so needs to handle nearly anything, so implementing operations is not just a single ALU call - it involves a series of bit-shifting and arithmetic operations.
If you can keep the counter below sys.maxint, you'll be using int types instead, and they both appear to be faster & also more dominated by actual math code:
In [55]: %timeit x % (1<<15)
10000000 loops, best of 3: 53.6 ns per loop
In [56]: %timeit x & ((1<<15)-1)
10000000 loops, best of 3: 49.2 ns per loop
In [57]: %timeit x % (2**15)
10000000 loops, best of 3: 53.9 ns per loop
These are all about the same, so it doesn't matter which one you use here really. (mod slightly slower, but within random variation). It makes sense for div/mod to be an expensive operation on very large numbers, with a more complex algorithm, while for 'small' ints it can be done in hardware.