How to boost efficiency for large number modulus multiplications in Python

How to boost efficiency for large number modulus multiplications in Python - python

I am using python3 without any tailored library for some simple arithmetic. The operation that dominates computational efficiency is a multiplication of many 2048 bit values:
length=len(array)
res=1
for x in range(length):
res=(res*int(array[x]))
ret=res%n2
To give you an insight it takes ~3940 seconds to make 10000 multiplications moduli a number for every multiplication for an:
Intel Core i5 CPU M 560 # 2.67GHz × 4 with 8GB of memory, running Ubuntu 12.04 32bit machine.
Would it make sense to boost it up using a library like gmpy2 or there would not be any advantage?

You seem to be calculating the product of all the numbers first then taking the remainder, rather than exploiting the properties of modular multiplication: a * b * c mod p == (a * b mod p) * c mod p. This takes very little time at all to multiply 10,000 2048-bit numbers modulo some n:
In [1]: import random
In [2]: array = [random.randrange(2**2048) for i in range(10000)]
In [3]: n = random.randrange(2**2048)
In [4]: prod = 1
In [5]: %%time
...: for e in array:
...: prod *= e
...: prod %= n
...:
CPU times: user 210 ms, sys: 4.07 ms, total: 214 ms
Wall time: 206 ms
For you, I would suggest:
array = map(int, array)
prod = 1
for x in array:
prod *= x
prod %= n2

Related

Efficient summation in Python

I am trying to efficiently compute a summation of a summation in Python:
WolframAlpha is able to compute it too a high n value: sum of sum.
I have two approaches: a for loop method and an np.sum method. I thought the np.sum approach would be faster. However, they are the same until a large n, after which the np.sum has overflow errors and gives the wrong result.
I am trying to find the fastest way to compute this sum.
import numpy as np
import time
def summation(start,end,func):
sum=0
for i in range(start,end+1):
sum+=func(i)
return sum
def x(y):
return y
def x2(y):
return y**2
def mysum(y):
return x2(y)*summation(0, y, x)
n=100
# method #1
start=time.time()
summation(0,n,mysum)
print('Slow method:',time.time()-start)
# method #2
start=time.time()
w=np.arange(0,n+1)
(w**2*np.cumsum(w)).sum()
print('Fast method:',time.time()-start)

Here's a very fast way:
result = ((((12 * n + 45) * n + 50) * n + 15) * n - 2) * n // 120
How I got there:
Rewrite the inner sum as the well-known x*(x+1)//2. So the whole thing becomes sum(x**2 * x*(x+1)//2 for x in range(n+1)).
Rewrite to sum(x**4 + x**3 for x in range(n+1)) // 2.
Look up formulas for sum(x**4) and sum(x**3).
Simplify the resulting mess to (12*n**5 + 45*n**4 + 50*n**3 + 15*n**2 - 2*n) // 120.
Horner it.
Another way to derive it if after steps 1. and 2. you know it's a polynomial of degree 5:
Compute six values with a naive implementation.
Compute the polynomial from the six equations with six unknowns (the polynomial coefficients). I did it similarly to this, but my matrix A is left-right mirrored compared to that, and I called my y-vector b.
Code:
from fractions import Fraction
import math
from functools import reduce
def naive(n):
return sum(x**2 * sum(range(x+1)) for x in range(n+1))
def lcm(ints):
return reduce(lambda r, i: r * i // math.gcd(r, i), ints)
def polynomial(xys):
xs, ys = zip(*xys)
n = len(xs)
A = [[Fraction(x**i) for i in range(n)] for x in xs]
b = list(ys)
for _ in range(2):
for i0 in range(n):
for i in range(i0 + 1, n):
f = A[i][i0] / A[i0][i0]
for j in range(i0, n):
A[i][j] -= f * A[i0][j]
b[i] -= f * b[i0]
A = [row[::-1] for row in A[::-1]]
b.reverse()
coeffs = [b[i] / A[i][i] for i in range(n)]
denominator = lcm(c.denominator for c in coeffs)
coeffs = [int(c * denominator) for c in coeffs]
horner = str(coeffs[-1])
for c in coeffs[-2::-1]:
horner += ' * n'
if c:
horner = f"({horner} {'+' if c > 0 else '-'} {abs(c)})"
return f'{horner} // {denominator}'
print(polynomial((x, naive(x)) for x in range(6)))
Output (Try it online!):
((((12 * n + 45) * n + 50) * n + 15) * n - 2) * n // 120

(fastest methods, 3 and 4, are at the end)
In a fast NumPy method you need to specify dtype=np.object so that NumPy does not convert Python int to its own dtypes (np.int64 or others). It will now give you correct results (checked it up to N=100000).
# method #2
start=time.time()
w=np.arange(0, n+1, dtype=np.object)
result2 = (w**2*np.cumsum(w)).sum()
print('Fast method:', time.time()-start)
Your fast solution is significantly faster than the slow one. Yes, for large N's, but already at N=100 it is like 8 times faster:
start=time.time()
for i in range(100):
result1 = summation(0, n, mysum)
print('Slow method:', time.time()-start)
# method #2
start=time.time()
for i in range(100):
w=np.arange(0, n+1, dtype=np.object)
result2 = (w**2*np.cumsum(w)).sum()
print('Fast method:', time.time()-start)
Slow method: 0.06906533241271973
Fast method: 0.008007287979125977
EDIT: Even faster method (by KellyBundy, the Pumpkin) is by using pure python. Turns out NumPy has no advantage here, because it has no vectorized code for np.objects.
# method #3
import itertools
start=time.time()
for i in range(100):
result3 = sum(x*x * ysum for x, ysum in enumerate(itertools.accumulate(range(n+1))))
print('Faster, pure python:', (time.time()-start))
Faster, pure python: 0.0009944438934326172
EDIT2: Forss noticed that numpy fast method can be optimized by using x*x instead of x**2. For N > 200 it is faster than pure Python method. For N < 200 it is slower than pure Python method (the exact value of boundary may depend on machine, on mine it was 200, its best to check it yourself):
# method #4
start=time.time()
for i in range(100):
w = np.arange(0, n+1, dtype=np.object)
result2 = (w*w*np.cumsum(w)).sum()
print('Fast method x*x:', time.time()-start)

Comparing Python with WolframAlpha like that is unfair, since Wolfram will simplify the equation before computing.
Fortunately, the Python ecosystem knows no limits, so you can use SymPy:
from sympy import summation
from sympy import symbols
n, x, y = symbols("n,x,y")
eq = summation(x ** 2 * summation(y, (y, 0, x)), (x, 0, n))
eq.evalf(subs={"n": 1000})
It will compute the expected result almost instantly: 100375416791650. This is because SymPy simplifies the equation for you, just like Wolfram does. See the value of eq:
#Kelly Bundy's answer is awesome, but if you are like me and use a calculator to compute 2 + 2, then you will love SymPy ❤. As you can see, it gets you to the same results with just 3 lines of code and is a solution that would also work for other more complex cases.

In a comment, you mention that it's really f(x) and g(y) instead of x2 and y. If you're only needing an approximation to that sum, you can pretend the sums are midpoint Riemann sums, so that your sum is approximated by the double integral ∫-.5n+.5 f(x) ∫-.5x+.5 g(y) dy dx.
With your original f(x)=x2 and g(y)=y, this simplifies to n5/10+3n4/8+n3/2+5n2/16+3n/32+1/160, which differs from the correct result by n3/12+3n2/16+53n/480+1/160.
Based on this, I suspect that (actual-integral)/actual would be max(f'',g'')*O(n-2), but I wasn't able to prove it.

All the answers uses math to simplify or implement the loop in python trying to be cpu optimal, but they are not memory optimal.
Here a naive implementation without using any math simplification which is memory efficient
def function5():
inner_sum = float()
result = float()
for x in range(0, n + 1):
inner_sum += x
result += x ** 2 * inner_sum
return result
It is quite slow with respect to the other solutions by dankal444:
method 2 | 31 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
method 3 | 116 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
method 4 | 91 µs ± 356 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
function 5 | 217 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
by the way if you jit the function with numba (there may be better options):
from numba import jit
function5 = jit(nopython=True)(function5)
you get
59.8 ns ± 0.209 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Comparing numpy array with itself by element efficiently

I am performing a large number of these calculations:
A == A[np.newaxis].T
where A is a dense numpy array which frequently has common values.
For benchmarking purposes we can use:
n = 30000
A = np.random.randint(0, 1000, n)
A == A[np.newaxis].T
When I perform this calculation, I run into memory issues. I believe this is because the output isn't in more efficient bitarray or np.packedbits format. A secondary concern is we are performing twice as many comparisons as necessary, since the resulting Boolean array is symmetric.
The questions I have are:
Is it possible to produce the Boolean numpy array output in a more memory efficient fashion without sacrificing speed? The options I know about are bitarray and np.packedbits, but I only know how to apply these after the large Boolean array is created.
Can we utilise the symmetry of our calculation to halve the number of comparisons processed, again without sacrificing speed?
I will need to be able to perform & and | operations on Boolean arrays output. I have tried bitarray, which is super-fast for these bitwise operations. But it is slow to pack np.ndarray -> bitarray and then unpack bitarray -> np.ndarray.
[Edited to provide clarification.]

Here's one with numba to give us a NumPy boolean array as output -
from numba import njit
#njit
def numba_app1(idx, n, s, out):
for i,j in zip(idx[:-1],idx[1:]):
s0 = s[i:j]
c = 0
for p1 in s0[c:]:
for p2 in s0[c+1:]:
out[p1,p2] = 1
out[p2,p1] = 1
c += 1
return out
def app1(A):
s = A.argsort()
b = A[s]
n = len(A)
idx = np.flatnonzero(np.r_[True,b[1:] != b[:-1],True])
out = np.zeros((n,n),dtype=bool)
numba_app1(idx, n, s, out)
out.ravel()[::out.shape[1]+1] = 1
return out
Timings -
In [287]: np.random.seed(0)
...: n = 30000
...: A = np.random.randint(0, 1000, n)
# Original soln
In [288]: %timeit A == A[np.newaxis].T
1 loop, best of 3: 317 ms per loop
# #Daniel F's soln-1 that skips assigning lower diagonal in output
In [289]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 450 ms per loop
# #Daniel F's soln-2 (complete one)
In [291]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 634 ms per loop
# Solution from this post
In [292]: %timeit app1(A)
10 loops, best of 3: 66.9 ms per loop

This isn't even a numpy answer, but should work to keep your data requirements down by using a bit of homebrewed sparse notation
from numba import jit
#jit # because this is gonna be loopy
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(i + 1, n):
if A[i] == A[j]:
c.append((i, j))
return c
Now c is a list of coordinate tuples (i, j), i < j that correspond to coordinates in your boolean array that are "True". You can easily do and and or operations on these setwise:
list(set(c1) & set(c2))
list(set(c1) | set(c2))
Later, when you want to apply this mask to an array, you can back out the coordinates and use them for fancy indexing instead:
i_, j_ = list(np.array(c).T)
i = np.r_[i_, j_, np.arange(n)]
j = np.r_[j_, i_, np.arange(n)]
You can then np.lexsort i nd j if you care about order
Alternatively, you can define sparse_outer_eq as:
#jit
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(n):
if A[i] == A[j]:
c.append((i, j))
return c
Which keeps >2x the data, but then the coordinates come out simply:
i, j = list(np.array(c).T)
if you've done any set operations, this will still need to be lexsorted if you want a rational order.
If your coordinates are each n-bit integers, this should be more space-efficient than boolean format as long as your sparsity is less than 1/n -> 3% or so for 32-bit.
as for time, thanks to numba it's even faster than broadcasting:
n = 3000
A = np.random.randint(0, 1000, n)
%timeit sparse_outer_eq(A)
100 loops, best of 3: 4.86 ms per loop
%timeit A == A[:, None]
100 loops, best of 3: 11.8 ms per loop
and comparisons:
a = A == A[:, None]
b = B == B[:, None]
a_ = sparse_outer_eq(A)
b_ = sparse_outer_eq(B)
%timeit a & b
100 loops, best of 3: 5.9 ms per loop
%timeit list(set(a_) & set(b_))
1000 loops, best of 3: 641 µs per loop
%timeit a | b
100 loops, best of 3: 5.52 ms per loop
%timeit list(set(a_) | set(b_))
1000 loops, best of 3: 955 µs per loop
EDIT: if you want to do &~ (as per your comment) use the second sparse_outer_eq method (so you don't have to keep track of the diagonal) and just do:
list(set(a_) - set(b_))

Here is the more or less canonical argsort solution:
import numpy as np
def f_argsort(A):
idx = np.argsort(A)
As = A[idx]
ne_ = np.r_[True, As[:-1] != As[1:], True]
bnds = np.flatnonzero(ne_)
valid = np.diff(bnds) != 1
return [idx[bnds[i]:bnds[i+1]] for i in np.flatnonzero(valid)]
n = 30000
A = np.random.randint(0, 1000, n)
groups = f_argsort(A)
for grp in groups:
print(len(grp), set(A[grp]), end=' ')
print()

I'm adding a solution to my question because it satisfies these 3 properties:
Low, fixed, memory requirement
Fast bitwise operations (&, |, ~, etc)
Low storage, 1-bit per Boolean via packing integers
The downside is it is stored in np.packbits format. It is substantially slower than other methods (especially argsort), but if speed is not an issue the algorithm should work well. If anyone figures a way to optimise further, this would be very helpful.
Update: A more efficient version of the below algorithm can be found here: Improving performance on comparison algorithm np.packbits(A==A[:, None], axis=1).
import numpy as np
from numba import jit
#jit(nopython=True)
def bool2int(x):
y = 0
for i, j in enumerate(x):
if j: y += int(j)<<(7-i)
return y
#jit(nopython=True)
def compare_elementwise(arr, result, section):
n = len(arr)
for row in range(n):
for col in range(n):
section[col%8] = arr[row] == arr[col]
if ((col + 1) % 8 == 0) or (col == (n-1)):
result[row, col // 8] = bool2int(section)
section[:] = 0
return result
A = np.random.randint(0, 10, 100)
n = len(A)
result_arr = np.zeros((n, n // 8 if n % 8 == 0 else n // 8 + 1)).astype(np.uint8)
selection_arr = np.zeros(8).astype(np.uint8)
packed = compare_elementwise(A, result_arr, selection_arr)

cython prange not so fast than single thread

I just wrote a trivial program to test how cython's prange performs, and here is the code:
from cython.parallel import prange
import numpy as np
def func(int r, int c):
cdef:
double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
double total = 0
int i, j
for i in prange(r, nogil=True, schedule='static', chunksize=1):
for j in range(c):
total += a[i,j]
return total
On Mac Book pro, with OMP_NUM_THREADS=3, the above code takes almost 18 sec for (r,c) = (10000, 100000), and with single thread, it takes about 21 sec.
Why there is so little performance boost? Am I using this prange correctly?

Have you timed how long it takes just to allocate a? A 10000 x 100000 float64 array takes up 8GB of memory.
a = np.ones((10000, 100000), np.double)
takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a, parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.
To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx:
#cython: boundscheck=False, wraparound=False, initializedcheck=False
from cython.parallel cimport prange
def serial(double[:, :] a):
cdef:
double total = 0
int i, j
for i in range(a.shape[0]):
for j in range(a.shape[1]):
total += a[i, j]
return total
def parallel(double[:, :] a):
cdef:
double total = 0
int i, j
for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
for j in range(a.shape[1]):
total += a[i, j]
return total
For example:
In [1]: import tmp
In [2]: r, c = 10000, 100000
In [3]: a = np.random.randn(r, c) # this takes ~6.75 sec
In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop
In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop
Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a.
The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.
* You could do a little better by passing larger chunks of a to each worker process, e.g. by increasing chunksize or using schedule='guided'

Comparing runtime of standard list comprehension vs. NumPy

I'm experimenting with NumPy to see how and where it is faster than using generic list comprehensions in Python. Here's a standard coding question I'm using for this experiment.
Find the sum of all the multiples of 3 or 5 below 1000000.
I have written three functions to compute this number.
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
I first did a quick sanity check to see that the three functions produced the same answer.
I then used timeit to compare the runtimes for the three functions.
%timeit -n 100 fA(1000000)
100 loops, best of 3: 182 ms per loop
%timeit -n 100 fB(1000000)
100 loops, best of 3: 14.4 ms per loop
%timeit -n 100 fC(1000000)
100 loops, best of 3: 44 ms per loop
It's no surprise that fA is the slowest. But why is fB so much better than fC? Is there a better way to compute this answer using NumPy?
I don't think size is an issue here. In fact, if I change the 1e6 to 1e9, fC becomes even slower when compared to fB.

fB is so much faster than fC because fC is not the NumPy equivalent of fB. fC is the NumPy equivalent of fA. This is the NumPy equivalent of fB:
def fD(M):
multiples_3 = np.arange(0, M, 3)
multiples_5 = np.arange(0, M, 5)
multiples_15 = np.arange(0, M, 15)
return multiples_3.sum() + multiples_5.sum() - multiples_15.sum()
It runs way faster:
In [4]: timeit fB(1000000)
100 loops, best of 3: 9.96 ms per loop
In [5]: timeit fD(1000000)
1000 loops, best of 3: 637 µs per loop

In fB you are constructing the ranges with the exact multiples you want. Their sizes become smaller from 3 to 5 to 15 and thus each takes less time to construct than the one before, after they are constructed you only need to take the sum and do some arithmetic.
In fC you are constructing a 100000 element array, the size isn't really the issue as much as the two modulo comparisons you are doing which must look at every single element in the array. This takes the lion's share of the execution time (about 90 %) for fC.

You're only really using numpy there to generate an array. You'd see a much bigger difference if you were trying to perform operations on arrays as opposed to performing them on lists or tuples. With regards to that particular problem, take a look at the function fD in the code below, which just calculates how many multiples there should be in each range and then calculates their sum, rather than generating the array. Actually, if you run the below snippet, you'll see how the times change in function of M. Also, fC breaks down for M >= 100000. I couldn't tell you why.
import numpy as np
from time import time
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
def fD(M):
return sum_mult(M,3)+sum_mult(M,5)-sum_mult(M,15)
def sum_mult(M,n):
instances=(M-1)/n
check=len(range(n,M,n))
return (n*instances*(instances+1))/2
for x in range(5,20):
print "*"*20
M=2**x
print M
answers=[]
T=[]
for f in (fA,fB,fC,fD):
ts=time()
answers.append(f(M))
for i in range(20):
f(M)
T.append(time()-ts)
if not all([x==answers[0] for x in answers]):
print "Warning! Answers do not match!",answers
print T

What is the best way to compute the trace of a matrix product in numpy?

If I have numpy arrays A and B, then I can compute the trace of their matrix product with:
tr = numpy.linalg.trace(A.dot(B))
However, the matrix multiplication A.dot(B) unnecessarily computes all of the off-diagonal entries in the matrix product, when only the diagonal elements are used in the trace. Instead, I could do something like:
tr = 0.0
for i in range(n):
tr += A[i, :].dot(B[:, i])
but this performs the loop in Python code and isn't as obvious as numpy.linalg.trace.
Is there a better way to compute the trace of a matrix product of numpy arrays? What is the fastest or most idiomatic way to do this?

You can improve on #Bill's solution by reducing intermediate storage to the diagonal elements only:
from numpy.core.umath_tests import inner1d
m, n = 1000, 500
a = np.random.rand(m, n)
b = np.random.rand(n, m)
# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))
%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop
%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop
%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop
Another option is to use np.einsum and have no explicit intermediate storage at all:
# Will print the same as the others:
print np.einsum('ij,ji->', a, b)
On my system it runs slightly slower than using inner1d, but it may not hold for all systems, see this question:
%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop

From wikipedia you can calculate the trace using the hadamard product (element-wise multiplication):
# Tr(A.B)
tr = (A*B.T).sum()
I think this takes less computation than doing numpy.trace(A.dot(B)).
Edit:
Ran some timers. This way is much faster than using numpy.trace.
In [37]: timeit("np.trace(A.dot(B))", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[38]: 8.6434469223022461
In [39]: timeit("(A*B.T).sum()", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[40]: 0.5516049861907959

Note that one slight variant is to take the dot product of the vectorized matrices. In python, vectorization is done using .flatten('F'). It's slightly slower than taking the sum of the Hadamard product, on my computer, so it's a worse solution than wflynny's , but I think it's kind of interesting, since it can be more intuitive, in some situations, in my opinion. For example, personally I find that for the matrix normal distribution, the vectorized solution is easier for me to understand.
Speed comparison, on my system:
import numpy as np
import time
N = 1000
np.random.seed(123)
A = np.random.randn(N, N)
B = np.random.randn(N, N)
tart = time.time()
for i in range(10):
C = np.trace(A.dot(B))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = A.flatten('F').dot(B.T.flatten('F'))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A.T * B).sum()
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A * B.T).sum()
print(time.time() - start, C)
Result:
6.246593236923218 -629.370798672
0.06539678573608398 -629.370798672
0.057890892028808594 -629.370798672
0.05709719657897949 -629.370798672

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to boost efficiency for large number modulus multiplications in Python - python

Related

Efficient summation in Python

Comparing numpy array with itself by element efficiently

cython prange not so fast than single thread

Comparing runtime of standard list comprehension vs. NumPy

What is the best way to compute the trace of a matrix product in numpy?

Categories

Resources