How to speed up for loop in python using Cython - python

I am trying to make a sensor using Beaglebone Black(BBB) and Python. I need to get as much data as possible per second from a sensor. The code bellow allows me to collect about 100,000 data points per second.
import Adafruit_BBIO_GPIO as GPIO
import time
GPIO.setup("P8_13", GPIO.IN)
def get_data(n):
my_list = []
start_time = time.time()
for i in range(n):
my_list.append(GPIO.input("P8_13"))
end_time = time.time() - start_time
print "Time: {}".format(end-time)
return my_list
n = 100000
get_data(n)
If n = 1,000,000, it takes around 10 seconds to populate my_list which is the same rate when n = 100,000 and time = 1s.
I decided to try Cython to get better results. I've heard it can significantly speed up python code. I followed the basic Cython tutorial: created data.pyx file with the python code above, then created a setup.py and, finally, built the Cython file.
Unfortunately, that did not help me at all. So, I am wondering if I am using Cython inappropriately or in this case, when there are no "heavy math computations", Cython cannot help too much. Any suggestions on how to speed up my code are highly appreciated!

You can start by adding a static type declaration:
import Adafruit_BBIO_GPIO as GPIO
import time
GPIO.setup("P8_13", GPIO.IN)
def get_data(int n): # declared as an int
my_list = []
start_time = time.time()
for i in range(n):
my_list.append(GPIO.input("P8_13"))
end_time = time.time() - start_time
print "Time: {}".format(end-time)
return my_list
n = 100000
get_data(n)
This allows the loop itself to be converted into a pure C loop, with the disadvantage that n is no longer arbitrary precision (so if you try to pass a value larger than ~2 billion, you'll get undefined behavior). This issue can be mitigated by changing int to unsigned long long, which allows values up to 2**64 - 1, or around 18 quintillion. The unsigned quantifier means you won't be able to pass a negative value.
You'll get a much more substantial speed boost if you can eliminate the list. Try replacing it with an array. Cython can work more efficiently with arrays than with lists.

I tried your same code, but with a different build of Adafruit_BBIO,
the one million count take only about 3 seconds to run on my rev C board.
I thought that the main change in the board from Rev B to Rev C was the fact that the eMMC was increased from 2GB to 4GB.
If you go and get the current Adafruit_BBIO, all you have to change in your above code is the first import statement, it should be Adafruit_BBIO.GPIO as GPIO
What have you tried out next?
Ron

Related

Does this manipulation in python save more time for my highly inefficient program?

I am using the following code unchanged in form but changed in content:
import numpy as np
import matplotlib.pyplot as plt
import random
from random import seed
from random import randint
import math
from math import *
from random import *
import statistics
from statistics import *
n=1000
T_plot=[0];
X_relm=[0];
class Objs:
def __init__(self, xIn, yIn, color):
self.xIn= xIn
self.yIn = yIn
self.color = color
def yfT(self, t):
return self.yIn*t+self.yIn*t
def xfT(self, t):
return self.xIn*t-self.yIn*t
xi=np.random.uniform(0,1,n);
yi=np.random.uniform(0,1,n);
O1 = [Objs(xIn = i, yIn = j, color = choice(["Black", "White"])) for i,j
in zip(xi,yi)]
X=sorted(O1,key=lambda x:x.xIn)
dt=1/(2*n)
T=20
iter=40000
Black=[]
White=[]
Xrelm=[]
for i in range(1,iter+1):
t=i*dt
for j in range(n-1):
check=X[j].xfT(t)-X[j+1].xfT(t);
if check<0:
X[j],X[j+1]=X[j+1],X[j]
if check<-10:
X[j].color,X[j+1].color=X[j+1].color,X[j].color
if X[j].color=="Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
Xrel=mean(Black)-mean(White)
Xrelm.append(Xrel)
plot1=plt.figure(1);
plt.plot(T_plot,Xrelm);
plt.xlabel("time")
plt.ylabel("Relative ")
and it keeps running (I left it for 10 hours) without giving output for some parameters simply because it's too big I guess. I know that my code is not faulty totally (in the sense that it should give something even if wrong) because it does give outputs for fewer time steps and other parameters.
So, I am focusing on trying to optimize my code so that it takes lesser time to run. Now, this is a routine task for coders but I am a newbie and I am coding simply because the simulation will help in my field. So, in general, any inputs of a general nature that give insights on how to make one's code faster are appreciated.
Besides that, I want to ask whether defining a function a priori for the inner loop will save any time.
I do not think it should save any time since I am doing the same thing but I am not sure maybe it does. If it doesn't, any insights on how to deal with nested loops in a more efficient way along with those of general nature are appreciated.
(I have tried to shorten the code as far as I could and still not miss relevant information)
There are several issues in your code:
the mean is recomputed from scratch based on the growing array. Thus, the complexity of mean(Black)-mean(White) is quadratic to the number of elements.
The mean function is not efficient. Using a basic sum and division is much faster. In fact, a manual mean is about 25~30 times faster on my machine.
The CPython interpreter is very slow so you should avoid using loops as much as possible (OOP code does not help either). If this is not possible and your computation is expensive, then consider using a natively compiled code. You can use tools like PyPy, Numba or Cython or possibly rewrite a part in C.
Note that strings are generally quite slow and there is no reason to use them here. Consider using enumerations instead (ie. integers).
Here is a code fixing the first two points:
dt = 1/(2*n)
T = 20
iter = 40000
Black = []
White = []
Xrelm = []
cur1, cur2 = 0, 0
sum1, sum2 = 0.0, 0.0
for i in range(1,iter+1):
t = i*dt
for j in range(n-1):
check = X[j].xfT(t) - X[j+1].xfT(t)
if check < 0:
X[j],X[j+1] = X[j+1],X[j]
if check < -10:
X[j].color, X[j+1].color = X[j+1].color, X[j].color
if X[j].color == "Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
delta1, delta2 = sum(Black[cur1:]), sum(White[cur2:])
sum1, sum2 = sum1+delta1, sum2+delta2
cur1, cur2 = len(Black), len(White)
Xrel = sum1/cur1 - sum2/cur2
Xrelm.append(Xrel)
Consider resetting Black and White to an empty list if you do not use them later.
This is several hundreds of time faster. It now takes 2 minutes as opposed to >20h (estimation) for the initial code.
Note that using a compiled code should be at least 10 times faster here so the execution time should be no more than dozens of seconds.
As mentioned in earlier comments, this one is a bit too broad to answer.
To illustrate; your iteration itself doesn't take very long:
import time
start = time.time()
for i in range(10000):
for j in range(10000):
pass
end = time.time()
print (end-start)
On my not-so-great machine that takes ~2s to complete.
So the looping portion is only a tiny fraction of your 10h+ run time.
The detail of what you're doing in the loop is the key.
Whilst very basic, the approach I've shown in the code above could be applied to your existing code to work out which bit(s) are the least performant and then raise a new question with some more specific, actionable detail.

Why string comparison is NOT faster then integer comparison in Python?

The difference in C++ is huge, but not in Python. I used similar code on C++, and the result is so different -- integer comparison is 20-30 times faster than string comparison.
Here is my example code:
import random, time
rand_nums = []
rand_strs = []
total_num = 1000000
for i in range(total_num):
randint = random.randint(0,total_num*10)
randstr = str(randint)
rand_nums.append(randint)
rand_strs.append(randstr)
start = time.time()
for i in range(total_num-1):
b = rand_nums[i+1]>rand_nums[i]
end = time.time()
print("integer compare:",end-start) # 0.14269232749938965 seconds
start = time.time()
for i in range(total_num-1):
b = rand_strs[i+1]>rand_strs[i]
end = time.time() # 0.15730643272399902 seconds
print("string compare:",end-start)
I can't explain why it's so slow in C++, but in Python, the reason is simple from your test code: random strings usually differ in the first byte, so the comparison time for those cases should be pretty much the same.
Also, not that much of your overhead will be in the loop control and list accesses. You'd get a much more accurate measure if you remove those factors by zipping the lists:
for s1, s2 in zip(rand_strs, rand_strs[1:]):
b = s1 > s2
The difference in C++ is huge, but not in Python.
The time spent in the comparison is minimal compared to the rest of the loop in Python. The actual comparison operation is implemented in Python's standard library C code, while the loop will execute through the interpreter.
As a test, you can run this code that performs all the same operations as the string comparison loop, except without the comparison:
start = time.time()
for i in range(total_num-1):
b = rand_strs[i+1], rand_strs[i]
end = time.time()
print("no compare:",end-start)
The times are pretty close to each other, though for me string comparison is always the slowest of the three loops:
integer compare: 1.2947499752044678
string compare: 1.3821675777435303
no compare: 1.3093421459197998

Very fast rolling hash in Python?

I'm writing a toy rsync-like tool in Python. Like many similar tools, it will first use a very fast hash as the rolling hash, and then a SHA256 once a match has been found (but the latter is out of topic here: SHA256, MDA5, etc. are too slow as a rolling hash).
I'm currently testing various fast hash methods:
import os, random, time
block_size = 1024 # 1 KB blocks
total_size = 10*1024*1024 # 10 MB random bytes
s = os.urandom(total_size)
t0 = time.time()
for i in range(len(s)-block_size):
h = hash(s[i:i+block_size])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
I get: 0.8 MB/s ... so the Python built-in hash(...) function is too slow here.
Which solution would allow a faster hash of at least 10 MB/s on a standard machine?
I tried with
import zlib
...
h = zlib.adler32(s[i:i+block_size])
but it's not much better (1.1 MB/s)
I tried with sum(s[i:i+block_size]) % modulo and it's slow too
Interesting fact: even without any hash fonction, the loop itself is slow!
t0 = time.time()
for i in range(len(s)-block_size):
s[i:i+block_size]
I get: 3.0 MB/s only! So the simpe fact of having a loop accessing to a rolling block on s is already slow.
Instead of reinventing the wheel and write my own hash / or use custom Rabin-Karp algorithms, what would you suggest, first to speed up this loop, and then as a hash?
Edit: (Partial) solution for the "Interesting fact" slow loop above:
import os, random, time, zlib
from numba import jit
#jit()
def main(s):
for i in range(len(s)-block_size):
block = s[i:i+block_size]
total_size = 10*1024*1024 # 10 MB random bytes
block_size = 1024 # 1 KB blocks
s = os.urandom(total_size)
t0 = time.time()
main(s)
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
With Numba, there is a massive improvement: 40.0 MB/s, but still no hash done here. At least we're not blocked at 3 MB/s.
Instead of reinventing the wheel and write my own hash / or use custom
Rabin-Karp algorithms, what would you suggest, first to speed up this
loop, and then as a hash?
It's always great to start with this mentality, but seems that you didn't get the idea of rolling hashes.
What makes a hashing function great for rolling is it's capability of reuse the previous processing.
A few hash functions allow a rolling hash to be computed very
quickly—the new hash value is rapidly calculated given only the old
hash value, the old value removed from the window, and the new value
added to the window.
From the same wikipedia page
It's hard to compare performance across different machines without timeit, but I changed your script to use a simple polynomial hashing with a prime modulo (would be even faster to work with a Mersene prime, because the modulo operation could be done with binary operations):
import os, random, time
block_size = 1024 # 1 KB blocks
total_size = 10*1024*1024 # 10 MB random bytes
s = os.urandom(total_size)
base = 256
mod = int(1e9)+7
def extend(previous_mod, byte):
return ((previous_mod * base) + ord(byte)) % mod
most_significant = pow(base, block_size-1, mod)
def remove_left(previous_mod, byte):
return (previous_mod - (most_significant * ord(byte)) % mod) % mod
def start_hash(bytes):
h = 0
for b in bytes:
h = extend(h, b)
return h
t0 = time.time()
h = start_hash(s[:block_size])
for i in range(block_size, len(s)):
h = remove_left(h, s[i - block_size])
h = extend(h, s[i])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
Apparently you achieved quite a improvement with Numba and it may speed up this code as well.
To extract more performance you may want to write a C (or other low-level language as Rust) functions to process a big slice of the list at time and returns an array with the hashes.
I'm creating a rsync-like tool as well, but as I'm writing in Rust performance in this level isn't a concern of mine. Instead, I'm following the tips of the creator of rsync and trying to parallelize everything I can, a painful task to do in Python (probably impossible without Jython).
what would you suggest, first to speed up this loop, and then as a hash?
Increase the blocksize. The smaller your blocksize the more python you'll be executing per byte, and the slower it will be.
edit: your range has the default step of 1 and you don't multiply i by block_size, so instead of iterating on 10*1024 non-overlapping blocks of 1k, you're iterating on 10 million - 1024 mostly overlapping blocks
First, your slow loop. As has been mentioned you are slicing a new block for every byte (less blocksize) in the stream. This is a lot of work on both cpu and memory.
A faster loop would be to pre chunk the data into parallel bits.
chunksize = 4096 # suggestion
# roll the window over the previous chunk's last block into the new chunk
lastblock = None
for readchunk in read_file_chunks(chunksize):
for i in range(0, len(readchunk), blocksize):
# slice a block only once
newblock = readchunk[i:blocksize]
if lastblock:
for bi in range(len(newblock)):
outbyte = lastblock[bi]
inbyte = newblock[bi]
# update rolling hash with inbyte and outbyte
# check rolling hash for "hit"
else:
pass # calculate initial weak hash, check for "hit"
lastblock = newblock
Chunksize should be a multiple of blocksize
Next, you were calculating a "rolling hash" over the entirety of each block in turn, instead of updating the hash byte by byte in "rolling" fashion. That is immensely slower. The above loop forces you to deal with the bytes as they go in and out of the window. Still, my trials show pretty poor throughput (~3Mbps~ edit: sorry that's 3MiB/s) even with a modest number of arithmetic operations on each byte. Edit: I initially had a zip() and that appears rather slow. I got more than double the throughout for the loop alone without the zip (current code above)
Python is single threaded and interpreted. I see one cpu pegged and that is the bottleneck. To get faster you'll want multiple threads (subprocess) or break into C, or both. Simply running the math in C would probably be enough I think. (Haha, "simply")

Speed up python 3.5 loop to run it as fast as python could be

I need to calculate distance between 2 xyz points in massive data (100 Gb, about 20 trylion points). I am trying to speed up this loop. I created KDtree, add parallel calculation's, split my array to smaller parts. So i guess all left to speed up is this loop. My pure python calculation time took about 10 hours 42 minutes. Adding numpy reduce time to 5 hours and 34 minutes. Adding numba speed it up to 4h 15 minutes. But it is still not fast enough. I heard that Cython is the fastest way for python calculation's but i don't have any experience in c and I don't know how to translate my function to cython code. How can i get this loop to run faster, using cython or any other way?
def controller(point_array, las_point_array):
empty = []
tree = spatial.cKDTree(point_array, leafsize=1000, copy_data = True)
empty = __pure_calc(las_point_array, point_array, empty, tree)
return ptList
#############################################################################################
#autojit
def __pure_calc(las_point_array, point_array, empty, tree):
for i in las_point_array:
p = tree.query(i)
euc_dist = math.sqrt(np.sum((point_array[p[1]]-i)**2))
##add one row at a time to empty list
empty.append([i[0], i[1], i[2], euc_dist, point_array[p[1]][0], point_array[p[1]][1], point_array[p[1]][2]])
return empty
I attach sample data for testing:
Sample
Your function builds a list (closestPt) that ends up looking like this:
[
[i0[0], i0[1], i0[2], distM0],
[i1[0], i1[1], i1[2], distM1],
...
]
The first thing you should do is to preallocate the entire result as a NumPy array (np.empty()), and write into it one row at a time. This will avoid a ton of memory allocations. Then you will note that you can defer the sqrt() to the very end, and run it on the distM column after your loops are all done.
There may be more optimization opportunities if you post a full working test harness with random/sample input data.
The key is to utilize vectorized functions as much as possible since any call to a pure python function inside the loop will more or less make the autojit pointless (the bottleneck will be the pure function call).
I noticed that the query function is vectorizable, and so is the euclidian distance calculation.
I'm not sure what your ptList variable in the controller function is (the example is a bit faulty), but assuming it is the output of your jit function, or close enfough to it, you should be able to do something like this:
def controller(point_array, las_point_array):
tree = spatial.cKDTree(point_array, leafsize=1000, copy_data = True)
distances, pt_idx = tree.query(las_point_array)
nearest_pts = point_array[pt_idx]
euc_distances = np.sqrt((nearest_pts - las_point_array).sum(axis=1) ** 2)
result = np.vstack((las_point_array.T, euc_distances.T, nearest_pts.T)).T
return result

Faster way to access 2D lists/arrays and do calculations in Python?

I am running 2 nested loops (first one 120 runs, second one 500 runs).
In most of the 120x500 runs I need to access several lists and lists-in-lists (I will call them 2D arrays).
At the moment the 120x500 runs take about 4 seconds. Most of the time is taken by the three list appends and several 2D array accesses.
The arrays are prefilled by me outside of the loops.
Here is my code:
#Range from 0 to 119
for cur_angle in range(0, __ar_angular_width-1):
#Range from 0 to 499
for cur_length in range(0, int(__ar_length * range_res_scale)-1):
v_x = (auv_rot_mat_0_0*self.adjacent_dx[cur_angle][cur_length])+(auv_rot_mat_0_1*self.opposite_dy[cur_angle][cur_length])
v_y = (auv_rot_mat_1_0*self.adjacent_dx[cur_angle][cur_length])+(auv_rot_mat_1_1*self.opposite_dy[cur_angle][cur_length])
v_x_diff = (v_x+auv_trans_x) - ocp_grid_origin_x
v_y_diff = (v_y+auv_trans_y) - ocp_grid_origin_y
p_x = (m.floor(v_x_diff/ocp_grid_resolution))
p_y = (m.floor(v_y_diff/ocp_grid_resolution))
data_index = int(p_y * ocp_grid_width + p_x)
if data_index >= 0 and data_index < (len(ocp_grid.data)-1):
probability = ocp_grid.data[data_index]
if probability == 100:
if not m.isnan(self.v_directions[cur_angle]):
magnitude = m.pow(probability, 2) * self.magnitude_ab[cur_length]
ov_1 = self.v_directions[cur_angle]
ov_2 = magnitude
ov_3 = self.distances[cur_length]
obstacle_vectors.append(ov_1)
obstacle_vectors.append(ov_2)
obstacle_vectors.append(ov_3)
I tried to figure out the processing times via time.time() and building differences, but it didn't work reliable. The calculated times were fluctuating quite a lot.
I am not really a Python pro, so any advices are welcome.
Any ideas how to make the code faster?
EDIT:
The initialization of the arrays was done with this code:
self.adjacent_dx = [i[:] for i in [[0]*(length_iterations-1)]*(angular_iterations-1)]
self.opposite_dy = [i[:] for i in [[0]*(length_iterations-1)]*(angular_iterations-1)]
Some general tips:
Try to optimize your algorithm the best You can.
If it's possible, then use PyPy instead of regular Python. This usually doesn't require any code modification if all Your external dependencies work with it.
Static typing and compilation to C can add additional boost, but requires some simple code modifications. For this purpose You can use Cython.
Note that step 1 is very ofter hard to do and most time consuming, giving you small amounts of boost if you have already good code, while the steps 2 and 3 give you dramatic boost without much additional effort.

Categories