optimization tool - python

I wonder there are any tool to optimize my program in term of loop unrolling, and how can I use it?
I have the following python code:
for i in range(0, 1000):
a = a * 10 + a%4 + i
for j in range(0, 1000):
j = j + a
for b in range(0, 1000):
result = j + b
I want to optimize this code segment so that I can try to understand loop unrolling a bit. With Python, I want to know a C optimizer.

a = 30
for i in range ( 0,1000 ) :
a = a * 10 + a%4 + i
can be rewritten as:
a = reduce(lambda a,b: a * 10 + a%4 + b, xrange(1000), 30)
takes about the same time (~4ms on my computer).
for j in range ( 0, 1000 ) :
j = j + a
doesn't make much sense. You are iterating j over 0-999, and each time add your huge a to it, which is immediately forgotten, because next j is taken. It can be rewritten as:
j = 999 + a
for b in range ( 0 , 1000 ) :
result = j + b
doesn't make much sense either. It is equivalent to:
result = j + 999

If you aren't satisfied with the performance of your code, have profiled it, and found that low-level loops like this are a bottleneck, you should be able to speed up your code hugely by using cython to turn the expensive bits of code into C extensions. Also, if you are using python 2.x, you should be using xrange instead of range.

There exists a scientific paper regarding effects of loop unrolling in Python (pdf link). These are the slides of the related talk.
However, in terms of automatic C code optimization you can use LLVM in combination with LooPo and possibly Polly. Anyway, LLVM is a good starting point.

Related

Converting SimpleCluster1D pseudo code to python

I refer to the dissertation written by Marcel R. Ackermann found https://d-nb.info/100345531X/34 . In the dissertation, Marcel wrote a pseudo-code for optimal 1-Dimensional K-Median algorithm. It is shown as such:
pseudo-code for optimal K-Median
I tried to convert the code into python, as shown below:
import math
import statistics
def cost(arr, median):
cost = 0
for i in range(len(arr)):
cost = cost + abs(arr[i] - median)
return cost
def simpleCluster1D(arr, k):
n = len(arr)
B = [[0] * k for i in range(n)]
C = [[0] * k for i in range(n)]
for i in range(k):
c = statistics.median(arr[:i+1])
B[i][0] = cost(arr[:i+1], c)
C[i][0] = c
for j in range(1, k):
for i in range(j, n):
B[i][j] = math.inf
C[i][j] = []
for t in range (j, i+1):
c = statistics.median(arr[t:i+1])
b = B[t-1][j-1] + cost(arr[t:i+1],c)
if b < B[i][j]:
B[i][j] = b
tmp = C[t-1][j-1]
C[i][j] = [C[t-1][j-1]] + [c]
return C[n-1][k-1]
However, the results i obtained is not intuitive.
For example, when
arr = [50,60,70,80]
k = 2
simpleCluster1D(arr, k)
The result is [0,80], which is wrong. The answer should be [55,75] or [50,70].
I don't know where I have gone wrong.
I am wondering if anyone can help me with this conversion? I am a little confused as to the declaration of the array C - column 1 of the array contains the median, and column 2 contains a list in each array index. How do I do that?
Also, are the libraries/packages available online for R/Python (e.g flexclust in R and pyclustering in Python) already has a built-in optimal 1-D solver? I know that for d >1, it is impossible to achieve optimal result and thus heuristics are used to obtain local optimal solution. Which is why I concluded that these libraries will also solve 1-D problems with heuristics and hence answer is not deterministic. Am I right to come to that conclusion?
I don't know where I have gone wrong.
You haven't. The error is in the dissertation; the line
1: for i = 1,2,...,k do
has to be
1: for i = 1,2,...,n do
- otherwise the rows from k+1 to n of the arrays B and C aren't fully initialized.

Cupy is slower than numpy

I tried to speed up my python code with cupy instead of numpy. The problem here is, that using cupy, my code got drastically slower. Maybe I went a little bit to naive on that problem.
Maybe anyone can find a bottleneck in my code:
import cupy as np
import time as ti
def f(y, t):
y_ = np.zeros(2 * N_1*N_2) # n: e-6, c: e-5
for i in range(0, N_1*N_2):
y_[i] = y[i + N_1*N_2] # n: e-7, c: e-5 or e-6
for i in range(N_1*N_2):
sum = -4*y[i] # n: e-7, c: e-7 after some statements e-5
if (i + 1 in indexes) and (not (i in indi)):
sum += y[i+1] # n: e-7, c: e-7 after some statements e-5
if (i - 1) in indexes and (i % N_1 != 0):
sum += y[i-1] # n: e-7, c: e-7 after some statements e-5
if i + N_1 in indexes:
sum += y[i+N_1] # n: e-7, c: e-7 after some statements e-5
if i - N_1 in indexes:
sum += y[i-N_1] # n: e-7, c: e-7 after some statements e-5
y_[i + N_1*N_2] = sum
return y_
def k_1(y, t, h):
return np.asarray(f(y, t)) * h
def k_2(y, t, h):
return np.asarray(f(np.add(np.asarray(y) , np.multiply(1/2 , k_1(y, t, h))), t + 1/2 * h)) * h
# k_2, k_4 look just like k_2, may be with an 1/2 here or there
# some init stuff is happening here
while t < T_end:
# also some magic happening here which is just data saving
y = np.asarray(y) + 1/6*(k_1(y, t, m) + 2*k_2(y, t, m) + 2*k_3(y, t, m) + k_4(y, t, m))
t += m
EDIT
I tried to benchmark my code and here are some results they can be seen as a comment in the code. Each number stays for one line. The units are seconds. n: Numpy, c:CuPy, i mostly give a rough estimate of the order.
Additional i tested
np.multiply # n: e-6, c: e-5
and
np.add # n: e-5 or e-6, c: 0.005 or e-5
Your code is not slow because numpy is slow but because you call many (python) functions, and calling functions (and iterating and accessing objects and basically everything in python) is slow in python. Thus cupy will not help you (but probably harm performance because it has to do more setup e.g. copying data over to the gpu). If you can formulate your algorithm to use less python functions (vectorizing as in the other answer) this will speedup your code tremendously (you probably do not need cupy).
You could also look into numba which compiles your code with llvm in native code. If you do so be sure to read some documenation and use nopython=True, otherwise you will only switch slow cupy code with slow numba code.
Your code example doesn't work since you haven't defined N_1, N_2, indexes and indi anywhere. Also your comments in the code doesn't seem to help others understand what's going on.
Your code probably won't benefit from numba/cupy since you haven't vectorized the operations in your code. Lists would probably be just as fast as numpy arrays in the way your code works at the moment.
If you get rid of your for loops and change
y_ = np.zeros(2 * N_1*N_2)
for i in range(0, N_1*N_2):
y_[i] = y[i + N_1*N_2]
to
n = N1*N2
y_ = np.zeros(2*n)
y_[:n] = y[n:2*n]
and so forth, you will speed your code up substantially.

How to optimize short string lexing in Python for speed

I'm trying to lex (i.e., tokenize) escaped strings in pure CPython fast (without resorting to C code).
The best I have been able to come up with is the following:
def bench(s, c, i, n):
m = 0
iteration = 0
while iteration < n:
# How do I optimize this part?
# Inputs: string s, index i
k = i
while True:
j = s.index(c, k, n)
sub = s[k:j]
if '\\' not in sub: break
k += sub.index('\\') + 2
# Outputs: substring s[i:j], index j
m += j - i
iteration += 1
return m
def test():
from time import clock
start = clock()
s = 'sd;fa;sldkfjas;kdfj;askjdf;askjd;fasdjkfa, "abcdefg", asdfasdfas;dfasdl;fjas;dfjk'
m = bench(s, '"', s.index('"') + 1, 3000000)
print "%.0f chars/usec" % (m / (clock() - start) / 1000000,)
test()
However, it's still somewhat slow for my taste. It seems that the invocation of .index is taking a lot of time in my actual project, though it doesn't seem to happen quite as often in this benchmark.
Most strings that it needs to lex can be assumed to be relatively short (say, 7 characters) and are unlikely to contain backslashes. I've already optimized for that somewhat. My question is:
Are there any optimizations I could make to speed up this code? If so, what?

How do I add one more loop to this?

Part of my code is:
list1 = zeros((x,y))
for j in range(1,y):
for i in range(1, x-1):
list1[i,j] = list1[i,j-1] + Equation
This works fine. However, when I want to get to the next stage, I need to modify the "Equation" part in the second for loop. Say the equation is (a*b+c)*d, I wish to make one of the parameters(a,b,c,d) varying with every increase in j.
That is, when j is 1, a = something. When j increases to 2, a changes according. It is like a is function of j. For example: a = A*cos(w*j).
My problem is, how do I loop this relation into the code so that a will be updated every time?
Just add an expression in the outer loop, calculating a based on the changing value of j:
for j in range(1, y):
a = A * cos(w * j)
for i in range(1, x-1):
list1[i, j] = list1[i, j - 1] + (a * b + c) * d

Why is this algorithm worse?

In Wikipedia this is one of the given algorithms to generate prime numbers:
def eratosthenes_sieve(n):
# Create a candidate list within which non-primes will be
# marked as None; only candidates below sqrt(n) need be checked.
candidates = [i for i in range(n + 1)]
fin = int(n ** 0.5)
# Loop over the candidates, marking out each multiple.
for i in range(2, fin + 1):
if not candidates[i]:
continue
candidates[i + i::i] = [None] * (n // i - 1)
# Filter out non-primes and return the list.
return [i for i in candidates[2:] if i]
I changed the algorithm slightly.
def eratosthenes_sieve(n):
# Create a candidate list within which non-primes will be
# marked as None; only candidates below sqrt(n) need be checked.
candidates = [i for i in range(n + 1)]
fin = int(n ** 0.5)
# Loop over the candidates, marking out each multiple.
candidates[4::2] = [None] * (n // 2 - 1)
for i in range(3, fin + 1, 2):
if not candidates[i]:
continue
candidates[i + i::i] = [None] * (n // i - 1)
# Filter out non-primes and return the list.
return [i for i in candidates[2:] if i]
I first marked off all the multiples of 2, and then I considered odd numbers only. When I timed both algorithms (tried 40.000.000) the first one was always better (albeit very slightly). I don't understand why. Can somebody please explain?
P.S.: When I try 100.000.000, my computer freezes. Why is that? I have Core Duo E8500, 4GB RAM, Windows 7 Pro 64 Bit.
Update 1: This is Python 3.
Update 2: This is how I timed:
start = time.time()
a = eratosthenes_sieve(40000000)
end = time.time()
print(end - start)
UPDATE: Upon valuable comments (especially by nightcracker and Winston Ewert) I managed to code what I intended in the first place:
def eratosthenes_sieve(n):
# Create a candidate list within which non-primes will be
# marked as None; only c below sqrt(n) need be checked.
c = [i for i in range(3, n + 1, 2)]
fin = int(n ** 0.5) // 2
# Loop over the c, marking out each multiple.
for i in range(fin):
if not c[i]:
continue
c[c[i] + i::c[i]] = [None] * ((n // c[i]) - (n // (2 * c[i])) - 1)
# Filter out non-primes and return the list.
return [2] + [i for i in c if i]
This algorithm improves the original algorithm (mentioned at the top) by (usually) 50%. (Still, worse than the algorithm mentioned by nightcracker, naturally).
A question to Python Masters: Is there a more Pythonic way to express this last code, in a more "functional" way?
UPDATE 2: I still couldn't decode the algorithm mentioned by nightcracker. I guess I'm too stupid.
The question is, why would it even be faster? In both examples you are filtering multiples of two, the hard way. It doesn't matter whether you hardcode candidates[4::2] = [None] * (n // 2 - 1) or that it gets executed in the first loop of for i in range(2, fin + 1):.
If you are interested in an optimized sieve of Eratosthenes, here you go:
def primesbelow(N):
# https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
#""" Input N>=6, Returns a list of primes, 2 <= p < N """
correction = N % 6 > 1
N = (N, N-1, N+4, N+3, N+2, N+1)[N%6]
sieve = [True] * (N // 3)
sieve[0] = False
for i in range(int(N ** .5) // 3 + 1):
if sieve[i]:
k = (3 * i + 1) | 1
sieve[k*k // 3::2*k] = [False] * ((N//6 - (k*k)//6 - 1)//k + 1)
sieve[(k*k + 4*k - 2*k*(i%2)) // 3::2*k] = [False] * ((N // 6 - (k*k + 4*k - 2*k*(i%2))//6 - 1) // k + 1)
return [2, 3] + [(3 * i + 1) | 1 for i in range(1, N//3 - correction) if sieve[i]]
Explanation here: Porting optimized Sieve of Eratosthenes from Python to C++
The original source is here, but there was no explanation. In short this primesieve skips multiples of 2 and 3 and uses a few hacks to make use of fast Python assignment.
You do not save a lot of time avoiding the evens. Most of the computation time within the algorithm is spent doing this:
candidates[i + i::i] = [None] * (n // i - 1)
That line causes a lot of action on the part of the computer. Whenever the number in question is even, this is not run as the loop bails on the if statement. The time spent running the loop for even numbers is thus really really small. So eliminating those even rounds does not produce a significant change in the timing of the loop. That's why your method isn't considerably faster.
When python produces numbers for range it uses a formula: start + index * step. Multiplying by two as you do in your case is going to be slightly more expensive then one as in the original case.
There is also quite possibly a small overhead to having a longer function.
Neither are of those are really significant speed issues, but they override the very small amount of benefit your version brings.
Its probably slightly slower because you are performing extra set up to do something that was done in the first case anyway (marking off multiples of two). That setup time might be what you see if it is as slight as you say
Your extra step is unnecessary and will actually traverse the whole collection n once doing that 'get rid of evens' operation rather than just operating on n^1/2.

Categories