Fast multiplication and modulo operation

Fast multiplication and modulo operation - python

I need to compute:
x=(x*a+b)/2 % 2**128
many times. x,a,b are 128-bit numbers (choosed randomly). How to do it in fastest way? I thought about numpy, could it help somehow? Now it is about 100 times to slow... Is there way to do it faster? Of course it has to be done separately, step by step, algorithm is more coplicated than this (a,b is changed after few steps), so we can't try to do here any math or fast exponentiation.
Example of more complete code:
a=333
b=555
c=777
d=999
x=12345
for i in range(128):
if x % 2 == 1:
x=((x * a + b)/2) % 340282366920938463463374607431768211456
else:
x=(x * c/2 + d) % 340282366920938463463374607431768211456
print(x)

You're going to have some troubles in that you're doing inherently inefficient operations:128-bit integers are not native to most Python implementations, and will incur the penalties of longint operations. However, you can drop the execution time by about 20% if you use shift & mask operations instead of division and modulus by powers of 2:
import timeit
a=333
b=555
c=777
d=999
x=12345
two_128 = 2 ** 128
mask = two_128 - 1
def rng_orig(x):
for i in range(128):
if x % 2 == 1:
x=((x * a + b)/2) % two_128
else:
x=(x * c/2 + d) % two_128
def rng_bit(x):
for i in range(128):
if x & 1:
x=((x * a + b) >> 1) & mask
else:
x=(x * (c >> 1) + d) & mask
repeat = 100000
print(timeit.timeit(lambda: rng_orig(x), number = repeat))
print(timeit.timeit(lambda: rng_bit (x), number = repeat))
Timing results:
5.1968478000000005
3.965898900000001

If this is intended to be integer arithmetics, you should use integer divisions. This will avoid unnecessary conversion to floats. Also, using bitwise operations for the modulo is probably going to be faster.
mask128 = 2**128 - 1
x = ( (x*a+b)//2 ) & mask128

Related

Let n be a square number. Using Python, how we can efficiently calculate natural numbers y up to a limit l such that n+y^2 is again a square number?

Using Python, I would like to implement a function that takes a natural number n as input and outputs a list of natural numbers [y1, y2, y3, ...] such that n + y1*y1 and n + y2*y2 and n + y3*y3 and so forth is again a square.
What I tried so far is to obtain one y-value using the following function:
def find_square(n:int) -> tuple[int, int]:
if n%2 == 1:
y = (n-1)//2
x = n+y*y
return (y,x)
return None
It works fine, eg. find_square(13689) gives me a correct solution y=6844. It would be great to have an algorithm that yields all possible y-values such as y=44 or y=156.

Simplest slow approach is of course for given N just to iterate all possible Y and check if N + Y^2 is square.
But there is a much faster approach using integer Factorization technique:
Lets notice that to solve equation N + Y^2 = X^2, that is to find all integer pairs (X, Y) for given fixed integer N, we can rewrite this equation to N = X^2 - Y^2 = (X + Y) * (X - Y) which follows from famous school formula of difference of squares.
Now lets rename two factors as A, B i.e. N = (X + Y) * (X - Y) = A * B, which means that X = (A + B) / 2 and Y = (A - B) / 2.
Notice that A and B should be of same odditiy, either both odd or both even, otherwise in last formulas above we can't have whole division by 2.
We will factorize N into all possible pairs of two factors (A, B) of same oddity. For fast factorization in code below I used simple to implement but yet quite fast algorithm Pollard Rho, also two extra algorithms were needed as a helper to Pollard Rho, one is Fermat Primality Test (which allows fast checking if number is probably prime) and second is Trial Division Factorization (which helps Pollard Rho to factor out small factors, which could cause Pollard Rho to fail).
Pollard Rho for composite number has time complexity O(N^(1/4)) which is very fast even for 64-bit numbers. Any faster factorization algorithm can be chosen if needed a bigger space to be searched. My fast algorithm time is dominated by speed of factorization, remaining part of algorithm is blazingly fast, just few iterations of loop with simple formulas.
If your N is a square itself (hence we know its root easily), then Pollard Rho can factor N even much faster, within O(N^(1/8)) time. Even for 128-bit numbers it means very small time, 2^16 operations, and I hope you're solving your task for less than 128 bit numbers.
If you want to process a range of possible N values then fastest way to factorize them is to use techniques similar to Sieve of Erathosthenes, using set of prime numbers, it allows to compute all factors for all N numbers within some range. Using Sieve of Erathosthenes for the case of range of Ns is much faster than factorizing each N with Pollard Rho.
After factoring N into pairs (A, B) we compute (X, Y) based on (A, B) by formulas above. And output resulting Y as a solution of fast algorithm.
Following code as an example is implemented in pure Python. Of course one can use Numba to speed it up, Numba usually gives 30-200 times speedup, for Python it achieves same speed as optimized C++. But I thought that main thing here is to implement fast algorithm, Numba optimizations can be done easily afterwards.
I added time measurement into following code. Although it is pure Python still my fast algorithm achieves 8500x times speedup compared to regular brute force approach for limit of 1 000 000.
You can change limit variable to tweak amount of searched space, or num_tests variable to tweak amount of different tests.
Following code implements both solutions - fast solution find_fast() described above plus very tiny brute force solution find_slow() which is very slow as it scans all possible candidates. This slow solution is only used to compare correctness in tests and compare speedup.
Code below uses nothing except few standard Python library modules, no external modules were used.
Try it online!
def find_slow(N):
import math
def is_square(x):
root = int(math.sqrt(float(x)) + 0.5)
return root * root == x, root
l = []
for y in range(N):
if is_square(N + y ** 2)[0]:
l.append(y)
return l
def find_fast(N):
import itertools, functools
Prod = lambda it: functools.reduce(lambda a, b: a * b, it, 1)
fs = factor(N)
mfs = {}
for e in fs:
mfs[e] = mfs.get(e, 0) + 1
fs = sorted(mfs.items())
del mfs
Ys = set()
for take_a in itertools.product(*[
(range(v + 1) if k != 2 else range(1, v)) for k, v in fs]):
A = Prod([p ** t for (p, _), t in zip(fs, take_a)])
B = N // A
assert A * B == N, (N, A, B, take_a)
if A < B:
continue
X = (A + B) // 2
Y = (A - B) // 2
assert N + Y ** 2 == X ** 2, (N, A, B, X, Y)
Ys.add(Y)
return sorted(Ys)
def trial_div_factor(n, limit = None):
# https://en.wikipedia.org/wiki/Trial_division
fs = []
while n & 1 == 0:
fs.append(2)
n >>= 1
all_checked = False
for d in range(3, (limit or n) + 1, 2):
if d * d > n:
all_checked = True
break
while True:
q, r = divmod(n, d)
if r != 0:
break
fs.append(d)
n = q
if n > 1 and all_checked:
fs.append(n)
n = 1
return fs, n
def fermat_prp(n, trials = 32):
# https://en.wikipedia.org/wiki/Fermat_primality_test
import random
if n <= 16:
return n in (2, 3, 5, 7, 11, 13)
for i in range(trials):
if pow(random.randint(2, n - 2), n - 1, n) != 1:
return False
return True
def pollard_rho_factor(n):
# https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm
import math, random
fs, n = trial_div_factor(n, 1 << 7)
if n <= 1:
return fs
if fermat_prp(n):
return sorted(fs + [n])
for itry in range(8):
failed = False
x = random.randint(2, n - 2)
for cycle in range(1, 1 << 60):
y = x
for i in range(1 << cycle):
x = (x * x + 1) % n
d = math.gcd(x - y, n)
if d == 1:
continue
if d == n:
failed = True
break
return sorted(fs + pollard_rho_factor(d) + pollard_rho_factor(n // d))
if failed:
break
assert False, f'Pollard Rho failed! n = {n}'
def factor(N):
import functools
Prod = lambda it: functools.reduce(lambda a, b: a * b, it, 1)
fs = pollard_rho_factor(N)
assert N == Prod(fs), (N, fs)
return sorted(fs)
def test():
import random, time
limit = 1 << 20
num_tests = 20
t0, t1 = 0, 0
for i in range(num_tests):
if (round(i / num_tests * 1000)) % 100 == 0 or i + 1 >= num_tests:
print(f'test {i}, ', end = '', flush = True)
N = random.randrange(limit)
tb = time.time()
r0 = find_slow(N)
t0 += time.time() - tb
tb = time.time()
r1 = find_fast(N)
t1 += time.time() - tb
assert r0 == r1, (N, r0, r1, t0, t1)
print(f'\nTime slow {t0:.05f} sec, fast {t1:.05f} sec, speedup {round(t0 / max(1e-6, t1))} times')
if __name__ == '__main__':
test()
Output:
test 0, test 2, test 4, test 6, test 8, test 10, test 12, test 14, test 16, test 18, test 19,
Time slow 26.28198 sec, fast 0.00301 sec, speedup 8732 times

For the easiest solution, you can try this:
import math
n=13689 #or we can ask user to input a square number.
for i in range(1,9999):
if math.sqrt(n+i**2).is_integer():
print(i)

Is there anyway to make this code to run faster [duplicate]

This question already has answers here:
calculate mod using pow function python
(3 answers)
Closed 1 year ago.
Is there any way to make this code run faster?
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
print(g**a%n)
It is running for too long I want to make it faster
I also tried:
import math
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
print(math.pow(g**a)%n)
and
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
def power(a,b):
ans = 1
for i in range(b):
ans *= a
return ans
print(power(g,a)%n)
All of these code is running for very long

First of all you need to know about Binary exponentiation algorithm. The idea is that instead of computing e.g. 5^46 like 5*5*5*5... 46 times, you can do
5^46 == 5^2 * 5^4 * 5^8 * 5^32
The key here, is that you can compute 5^2 fast from 5 (just square it), then 5^4 fast from 5^2 (just square it), then 5^8 from 5^4 (just square it) and so on. To determine which 5^K numbers you should multiply and which not, you can represent the power as a binary number, and multiply to the final result only those components, that correspond to 1 in this binary representation. E.g.
decimal 46 == binary 101110
Thus
5^1 is skipped (corresponds to right most 0), 5^2 is multiplied (corresponds to right most 1), 5^4 is multiplied(second from the right 1), 5^8 is multiplied (third from the right 1), 5^16 is skipped (the left most 0) and 5^32 is multiplied (the left most 1).
Next, you need to compute a very huge power, it's impractically big. But there is a shortcut, since you use modulo operation.
You see, there's a rule that
(a*b % n) == ( (a % n)*(b % n) ) % n
So these should be equivalent
5^46 % n == ( ( ( (5^2 % n) * (5^4 % n) % n) * (5^8 % n) % n) * (5^32 % n) % n)
Notice that each number we multiply won't ever exceed n, so the overall multiplication chain will not take forever, as n is big, but not even remotely as gigantic as g**a
In the code, all of that looks like that. It computes instantly
def pow_modulo_n(base, power, n):
result = 1
multiplier = base
while power > 0:
power, binary_digit = divmod(power, 2)
if binary_digit == 1:
result = (result * multiplier) % n
multiplier = (multiplier**2) % n
return result % n
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
print(pow_modulo_n(g, a, n))
This prints
434839845697636246

Python: Faster Universal Hashing function with built in libs

I am trying to implement the universal hashing function with only base libs:
I am having issues because I am unable to run this in an effective time. I know % is slow so I have tried the following:
((a * x + b) % P) % n
divmod(divmod(a * x + b, P)[1], n)[1]
subeq = pow(a * x + b, 1, P)
hash = pow(subeq, 1, self.n)
All of these function are too slow for what I am trying to do. Is there a faster way to do mod division only using the base libs that I am unaware of?
Edit To elaborate, I will be running this function about 200000 times (or more) and I need for all 200000 runs to complete in under 4 seconds. None of these methods are even in that ball park (taking minutes)

You're not going to do better than ((a * x + b) % P) % m in pure Python code; the overhead of the Python interpreter is going to bottleneck you more than anything else; yes, if you ensure the m is a power of two, you can precompute mm1 = m - 1 and change the computation to ((a * x + b) % P) & mm1, replacing a more expensive remaindering operation with a cheaper bitmasking operation, but unless P is huge (hundreds of bits minimum), the interpreter overhead will likely outweigh the differences between remainder and bitmasking.
If you really need the performance, and the types you're working with will fit in C level primitive type, you may benefit from writing a Python C extension that converts all the values to size_t, Py_hash_t, uint64_t, or whatever suits your problem and performs the math as a set of bulk conversions to C types, C level math, then a single conversion back to Python int, saving a bunch of byte code and intermediate values (that are promptly tossed).
If the values are too large to fit in C primitives, GMP types are an option (look at mpz_import and mpz_export for efficient conversions from PyLong to mpz_t and back), but the odds of seeing big savings go down; GMP does math faster in general, and can mutate numbers in place rather than creating and destroying lots of temporaries, but even with mpz_import and mpz_export, the cost of converting between Python and GMP types would likely eat most of the savings.

from math import ceil, log2
from primesieve import nth_prime #will get nth prime number [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]
from random import randint
class UniversalHashing:
""" N = #bins
p = prime number st: p >= N
nth_prime(1, 1 << max(32, ceil(log2(N))))
nth_prime(1,1<<max(32),ceil(log2(2)))))
nth_prime(1,2**32)
nth_prime(1,4294967296)
=4294967311
assert:- Returns Error if condition not satisfied
<< operatior:- multiply with 2the power like 2<<2 =2*2'2=8 or 7*2'3=56 and ceil will give the exact value or next vlue ceil(1)=1 , ceil(1.1)=2
randint:- Return a random integer N such that a <= N <= b. Alias for randrange(a, b+1). """
def __init__(self, N, p = None):
self.N = N
if p is None:
p = nth_prime(1, 1 << max(32, ceil(log2(N))))
assert p >= N, 'Prime number p should be at least N!'
self.p = p
def draw(self):
a = randint(1, self.p - 1)
b = randint(0, self.p - 1)
return lambda x: ((a * x + b) % self.p) % self.N
if __name__ == '__main__':
N = 50 #bins
n = 100000 #elements
H = UniversalHashing(N)
h = H.draw()
T = [0] * N
for _ in range(n):
x = randint(0, n * 10)
T[h(x)] += 1
for i in range(len(T)):
print(T[i] / n) # This should be approximately equal

How to find sum of cubes of the divisors for every number from 1 to input number x in python where x can be very large

Examples,
1.Input=4
Output=111
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
------------------------
sum = 111(output)
1.Input=5
Output=237
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
5 = 1³ + 5³(divisors of 5)
-----------------------------
sum = 237 (output)
x=int(raw_input().strip())
tot=0
for i in range(1,x+1):
for j in range(1,i+1):
if(i%j==0):
tot+=j**3
print tot
Using this code I can find the answer for small number less than one million.
But I want to find the answer for very large numbers. Is there any algorithm
for how to solve it easily for large numbers?

Offhand I don't see a slick way to make this truly efficient, but it's easy to make it a whole lot faster. If you view your examples as matrices, you're summing them a row at a time. This requires, for each i, finding all the divisors of i and summing their cubes. In all, this requires a number of operations proportional to x**2.
You can easily cut that to a number of operations proportional to x, by summing the matrix by columns instead. Given an integer j, how many integers in 1..x are divisible by j? That's easy: there are x//j multiples of j in the range, so divisor j contributes j**3 * (x // j) to the grand total.
def better(x):
return sum(j**3 * (x // j) for j in range(1, x+1))
That runs much faster, but still takes time proportional to x.
There are lower-level tricks you can play to speed that in turn by constant factors, but they still take O(x) time overall. For example, note that x // j == 1 for all j such that x // 2 < j <= x. So about half the terms in the sum can be skipped, replaced by closed-form expressions for a sum of consecutive cubes:
def sum3(x):
"""Return sum(i**3 for i in range(1, x+1))"""
return (x * (x+1) // 2)**2
def better2(x):
result = sum(j**3 * (x // j) for j in range(1, x//2 + 1))
result += sum3(x) - sum3(x//2)
return result
better2() is about twice as fast as better(), but to get faster than O(x) would require deeper insight.
Quicker
Thinking about this in spare moments, I still don't have a truly clever idea. But the last idea I gave can be carried to a logical conclusion: don't just group together divisors with only one multiple in range, but also those with two multiples in range, and three, and four, and ... That leads to better3() below, which does a number of operations roughly proportional to the square root of x:
def better3(x):
result = 0
for i in range(1, x+1):
q1 = x // i
# value i has q1 multiples in range
result += i**3 * q1
# which values have i multiples?
q2 = x // (i+1) + 1
assert x // q1 == i == x // q2
if i < q2:
result += i * (sum3(q1) - sum3(q2 - 1))
if i+1 >= q2: # this becomes true when i reaches roughly sqrt(x)
break
return result
Of course O(sqrt(x)) is an enormous improvement over the original O(x**2), but for very large arguments it's still impractical. For example better3(10**6) appears to complete instantly, but better3(10**12) takes a few seconds, and better3(10**16) is time for a coffee break ;-)
Note: I'm using Python 3. If you're using Python 2, use xrange() instead of range().
One more
better4() has the same O(sqrt(x)) time behavior as better3(), but does the summations in a different order that allows for simpler code and fewer calls to sum3(). For "large" arguments, it's about 50% faster than better3() on my box.
def better4(x):
result = 0
for i in range(1, x+1):
d = x // i
if d >= i:
# d is the largest divisor that appears `i` times, and
# all divisors less than `d` also appear at least that
# often. Account for one occurence of each.
result += sum3(d)
else:
i -= 1
lastd = x // i
# We already accounted for i occurrences of all divisors
# < lastd, and all occurrences of divisors >= lastd.
# Account for the rest.
result += sum(j**3 * (x // j - i)
for j in range(1, lastd))
break
return result
It may be possible to do better by extending the algorithm in "A Successive Approximation Algorithm for Computing the Divisor Summatory Function". That takes O(cube_root(x)) time for the possibly simpler problem of summing the number of divisors. But it's much more involved, and I don't care enough about this problem to pursue it myself ;-)
Subtlety
There's a subtlety in the math that's easy to miss, so I'll spell it out, but only as it pertains to better4().
After d = x // i, the comment claims that d is the largest divisor that appears i times. But is that true? The actual number of times d appears is x // d, which we did not compute. How do we know that x // d in fact equals i?
That's the purpose of the if d >= i: guarding that comment. After d = x // i we know that
x == d*i + r
for some integer r satisfying 0 <= r < i. That's essentially what floor division means. But since d >= i is also known (that's what the if test ensures), it must also be the case that 0 <= r < d. And that's how we know x // d is i.
This can break down when d >= i is not true, which is why a different method needs to be used then. For example, if x == 500 and i == 51, d (x // i) is 9, but it's certainly not the case that 9 is the largest divisor that appears 51 times. In fact, 9 appears 500 // 9 == 55 times. While for positive real numbers
d == x/i
if and only if
i == x/d
that's not always so for floor division. But, as above, the first does imply the second if we also know that d >= i.
Just for Fun
better5() rewrites better4() for about another 10% speed gain. The real pedagogical point is to show that it's easy to compute all the loop limits in advance. Part of the point of the odd code structure above is that it magically returns 0 for a 0 input without needing to test for that. better5() gives up on that:
def isqrt(n):
"Return floor(sqrt(n)) for int n > 0."
g = 1 << ((n.bit_length() + 1) >> 1)
d = n // g
while d < g:
g = (d + g) >> 1
d = n // g
return g
def better5(x):
assert x > 0
u = isqrt(x)
v = x // u
return (sum(map(sum3, (x // d for d in range(1, u+1)))) +
sum(x // i * i**3 for i in range(1, v)) -
u * sum3(v-1))

def sum_divisors(n):
sum = 0
i = 0
for i in range (1, n) :
if n % i == 0 and n != 0 :
sum = sum + i
# Return the sum of all divisors of n, not including n
return sum
print(sum_divisors(0))
# 0
print(sum_divisors(3)) # Should sum of 1
# 1
print(sum_divisors(36)) # Should sum of 1+2+3+4+6+9+12+18
# 55
print(sum_divisors(102)) # Should be sum of 2+3+6+17+34+51
# 114

Python: Division by larger numbers slower?

Why does dividing by the larger factor pair result in slower execution?
My solution for https://codility.com/programmers/task/min_perimeter_rectangle/
from math import sqrt, floor
# This fails the performance tests
def solution_slow(n):
x = int(sqrt(n))
for i in xrange(x, n+1):
if n % i == 0:
return 2*(i + n / i))
# This passes the performance tests
def solution_fast(n):
x = int(sqrt(n))
for i in xrange(x, 0, -1):
if n % i == 0:
return 2*(i + n / i)

It's not division that slows it down; it's the number of iterations required.
Let L = xrange(0, x) (order doesn't matter here) and R = xrange(x, n+1). Every factor of n in L can be paired with exactly one factor of n in R. In general, x is much, much smaller than n/2, so L is much smaller than R. This means that there are far more elements of R that don't divide n than there are in L. In the case of a prime number, there are no factors, so the slow solution has to check every value of the much larger than instead of the much smaller set.

That's obvious. The first function loops many more times.
Note that sqrt(n) != n - sqrt(n)! in general sqrt(n) << n-sqrt(n) where << means much lesser than.
If n=1000 the first function is looping 969 times while the second one only 32.

I'd say the of iterations is the key which makes perfomance a little bit different between your functions as #Bakuriu already said. Also, xrange could be slightly more expensive than using a simple loop, for instance, take a look f3 will perform a little better than f1 & f2:
import timeit
from math import sqrt, floor
def f1(n):
x = int(sqrt(n))
for i in xrange(x, n + 1):
if n % i == 0:
return 2 * (i + n / i)
def f2(n):
x = int(sqrt(n))
for i in xrange(x, 0, -1):
if n % i == 0:
return 2 * (i + n / i)
def f3(n):
x = int(sqrt(n))
while True:
if n % x == 0:
return 2 * (x + n / x)
x -= 1
N = 30
K = 100000
print("Measuring {0} times f1({1})={2}".format(
K, N, timeit.timeit('f1(N)', setup='from __main__ import f1, N', number=K)))
print("Measuring {0} times f1({1})={2}".format(
K, N, timeit.timeit('f2(N)', setup='from __main__ import f2, N', number=K)))
print("Measuring {0} times f1({1})={2}".format(
K, N, timeit.timeit('f3(N)', setup='from __main__ import f3, N', number=K)))
# Measuring 100000 times f1(30)=0.0738177938151
# Measuring 100000 times f1(30)=0.0753000788315
# Measuring 100000 times f1(30)=0.0503645315841
# [Finished in 0.3s]
Next time, you got these type of questions, using a profiler is highly recommended :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast multiplication and modulo operation - python

If this is intended to be integer arithmetics, you should use integer divisions. This will avoid unnecessary conversion to floats. Also, using bitwise operations for the modulo is probably going to be faster. mask128 = 2**128 - 1 x = ( (x*a+b)//2 ) & mask128

Related

Let n be a square number. Using Python, how we can efficiently calculate natural numbers y up to a limit l such that n+y^2 is again a square number?

Is there anyway to make this code to run faster [duplicate]

Python: Faster Universal Hashing function with built in libs

How to find sum of cubes of the divisors for every number from 1 to input number x in python where x can be very large

Python: Division by larger numbers slower?

Categories

Resources