How do I optimize this python code using cython? - python

I have some python code, and I'm wondering what I can do to optimize the speed for creating the array using Cython. Note that I have tried other methods: Counting Algorithm Performance Optimization in Pypy vs Python (Numpy vs List)
It seems like Cython is significantly faster than anything I've tried before right out of the box. I am wondering if I can get even more performance.
#!/usr/bin/env python
def create_array(size=4):
"""
Creates a multi-dimensional array from size
"""
array = [(x, y, z)
for x in xrange(size)
for y in xrange(size)
for z in xrange(size)]
return array
Thanks in advance!

I won't help with the cython-code, but I believe this operation can still be done efficiently in numpy, you just haven't looked deep enough yet.
def variations_with_repetition(alphabetlen):
"""Return a list of all possible sets of len=3 with elements
chosen from range(alphabetlen)."""
a = np.arange(alphabetlen)
z = np.vstack((
np.repeat(a,alphabetlen**2),
np.tile(np.repeat(a,alphabetlen),alphabetlen),
np.tile(a,alphabetlen**2))).T
return z
Now, execution speed here is meaningless in this case because you just mention you want it below 2ms for alphabetlen=32. That depends on your CPU. But I can compare your own proposed method to this one:
In [4]: %timeit array = [(x, y, z) for x in xrange(size) for y in xrange(size) for z in xrange(size)]
100 loops, best of 3: 3.3 ms per loop
In [5]: %timeit variations_with_repetition(32)
1000 loops, best of 3: 348 µs per loop
That's well below your desires 2ms speed. But once again, your mileage may vary depending on the CPU.

Related

Is it possible to improve python performance for this code?

I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?
You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().
Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?
Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.
While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.

Optimizing Many Matrix Operations in Python / Numpy

In writing some numerical analysis code, I have bottle-necked at a function that requires many Numpy calls. I am not entirely sure how to approach further performance optimization.
Problem:
The function determines error by calculating the following,
Code:
def foo(B_Mat, A_Mat):
Temp = np.absolute(B_Mat)
Temp /= np.amax(Temp)
return np.sqrt(np.sum(np.absolute(A_Mat - Temp*Temp))) / B_Mat.shape[0]
What would be the best way to squeeze some extra performance out of the code? Would my best course of action be performing the majority of the operations in a single for loop with Cython to cut down on the temporary arrays?
There are specific functions from the implementation that could be off-loaded to numexpr module which is known to be very efficient for arithmetic computations. For our case, specifically we could perform squaring, summation and absolute computations with it. Thus, a numexpr based solution to replace the last step in the original code, would be like so -
import numexpr as ne
out = np.sqrt(ne.evaluate('sum(abs(A_Mat - Temp**2))'))/B_Mat.shape[0]
A further performance boost could be achieved by embedding the normalization step into the numexpr's evaluate expression. Thus, the entire function modified to use numexpr would be -
def numexpr_app1(B_Mat, A_Mat):
Temp = np.absolute(B_Mat)
M = np.amax(Temp)
return np.sqrt(ne.evaluate('sum(abs(A_Mat*M**2-Temp**2))'))/(M*B_Mat.shape[0])
Runtime test -
In [198]: # Random arrays
...: A_Mat = np.random.randn(4000,5000)
...: B_Mat = np.random.randn(4000,5000)
...:
In [199]: np.allclose(foo(B_Mat, A_Mat),numexpr_app1(B_Mat, A_Mat))
Out[199]: True
In [200]: %timeit foo(B_Mat, A_Mat)
1 loops, best of 3: 891 ms per loop
In [201]: %timeit numexpr_app1(B_Mat, A_Mat)
1 loops, best of 3: 400 ms per loop

Can I vectorize this Python code?

I'm kind of new to Python and I have to implement "fast as possible" version of this code.
s="<%dH" % (int(width*height),)
z=struct.unpack(s, contents)
heights = np.zeros((height,width))
for r in range(0,height):
for c in range(0,width):
elevation=z[((width)*r)+c]
if (elevation==65535 or elevation<0 or elevation>20000):
elevation=0.0
heights[r][c]=float(elevation)
I've read some of the python vectorization questions... but I don't think it applies to my case. Most of the questions are things like using np.sum instead of for loops. I guess I have two questions:
Is it possible to speed up this code...I think heights[r][c]=float(elevation) is where the bottleneck is. I need to find some Python timing commands to confirm this.
If it possible to speed up this code. What are my options? I have seen some people recommend cython, pypy, weave. I could do this faster in C but this code also need to generate plots so I'd like to stick with Python so I can use matplotlib.
As you mention, the key to writing fast code with numpy involves vectorization, and pushing the work off to fast C-level routines instead of Python loops. The usual approach seems to improve things by a factor of ten or so relative to your original code:
def faster(elevation, height, width):
heights = np.array(elevation, dtype=float)
heights = heights.reshape((height, width))
heights[(heights < 0) | (heights > 20000)] = 0
return heights
>>> h,w = 100, 101; z = list(range(h*w))
>>> %timeit orig(z,h,w)
100 loops, best of 3: 9.71 ms per loop
>>> %timeit faster(z,h,w)
1000 loops, best of 3: 641 µs per loop
>>> np.allclose(orig(z,h,w), faster(z,h,w))
True
That ratio seems to hold even for longer z:
>>> h,w = 1000, 10001; z = list(range(h*w))
>>> %timeit orig(z,h,w)
1 loops, best of 3: 9.44 s per loop
>>> %timeit faster(z,h,w)
1 loops, best of 3: 675 ms per loop

NumPy vs Cython - nested loop so slow?

I am confused how NumPy nested loop for 3D array is so slow in comparison with Cython.
I wrote trivial example.
Python/NumPy version:
import numpy as np
def my_func(a,b,c):
s=0
for z in xrange(401):
for y in xrange(401):
for x in xrange(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s+=1
return s
a = np.zeros((401,401,401), dtype=np.float32)
b = np.zeros((401,401,401), dtype=np.uint32)
c = np.zeros((401,401,401), dtype=np.uint8)
s = my_func(a,b,c)
Cythonized version:
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def my_func(np.float32_t[:,:,::1] a, np.uint32_t[:,:,::1] b, np.uint8_t[:,:,::1] c):
cdef np.uint16_t z,y,x
cdef np.uint32_t s = 0
for z in range(401):
for y in range(401):
for x in range(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s = s+1
return s
Cythonized version of my_func() runs approx. 6500x faster. Simpler function only with if-statement and array access can be even 10000x faster. Python version of my_func() takes 500.651 sec. to finish. Is iterating over relatively small 3D array so slow or I made some mistake in code?
Cython version 0.21.1, Python 2.7.5, GCC 4.8.1, Xubuntu 13.10.
Python is an interpreted language. One of the benefits of compiling to machine code is the huge speedup you get, especially with things like nested loops.
I don't know what your expectations are, but all interpreted languages will be terribly slow at the things you are trying to do (JIT compiling may help to some extent though).
The trick of getting good performance out of Numpy (or MATLAB or anything similar) is to avoid looping altogether and instead try to refactor your code into a few operations on large matrices. This way, the looping will take place in the (heavily optimized) machine code libraries instead of your Python code.
As mentioned by Krumelur, python loops are definitely slow. You can, however, use numpy to your advantage. Operations on entire arrays are quite fast, although you need a little ingenuity sometimes.
For instance, in your code, since your loop never reads the value in b after you modify it (I think? My head is a little fuzzy at the moment, so you'll definitely want to go through this), the following should be equivalent:
# Precalculate a matrix of x*y*z
tmp = np.indices(a.shape)
prod = (tmp[:,:,:,0] * tmp[:,:,:,1] * tmp[:,:,:,2]).T
# Use array-wide logical operations to compute c using a and the transpose of b
condition = np.logical_and(a == 0, b.T >= 0)
# Use condition to alter b and c only where condition is true
b[condition] = prod[condition]
c[condition] = 1
s = condition.sum()
So this does calculate x*y*z even in cases where the condition is false. You could probably avoid that if it turns out that is using lots of time, but it's likely not to be a significant factor.
For loop with numpy array in python is slow, you should use vector calculation as possible. If the algorithm need for loop for every elements in the array, here is some speedup hint.
a[z,y,x] is a numpy scalar value, calculation with numpy scalar values is very slow:
x = 3.0
%timeit x > 0
x = np.float64(3.0)
%timeit x > 0
the output on my pc with numpy 1.8.2, windows 7:
10000000 loops, best of 3: 64.3 ns per loop
1000000 loops, best of 3: 657 ns per loop
you can use item() method to get the python value directly:
if a.item(z, y, x) == 0 and b.item(x, y, z) >= 0:
...
it can speedup the for loop about 8x.

sign() much slower in python than matlab?

I have a function in python that basically takes the sign of an array (75,150), for example.
I'm coming from Matlab and the time execution looks more or less the same less this function.
I'm wondering if sign() works very slowly and you know an alternative to do the same.
Thx,
I can't tell you if this is faster or slower than Matlab, since I have no idea what numbers you're seeing there (you provided no quantitative data at all). However, as far as alternatives go:
import numpy as np
a = np.random.randn(75, 150)
aSign = np.sign(a)
Testing using %timeit in IPython:
In [15]: %timeit np.sign(a)
10000 loops, best of 3: 180 µs per loop
Because the loop over the array (and what happens inside it) is implemented in optimized C code rather than generic Python code, it tends to be about an order of magnitude faster—in the same ballpark as Matlab.
Comparing the exact same code as a numpy vectorized operation vs. a Python loop:
In [276]: %timeit [np.sign(x) for x in a]
1000 loops, best of 3: 276 us per loop
In [277]: %timeit np.sign(a)
10000 loops, best of 3: 63.1 us per loop
So, only 4x as fast here. (But then a is pretty small here.)

Categories