Optimization of numpy array traversal

Optimization of numpy array traversal - python

1. Consider the following traversal of a numpy.ndarray
for ii in xrange(0,(nxTes-2)):
if ( (xCom-dtaCri-xcTes[ii]) * (xCom-dtaCri-xcTes[ii+1]) ) <= 0.0:
nxL=ii
if ( (xCom+dtaCri-xcTes[ii]) * (xCom+dtaCri-xcTes[ii+1]) ) <= 0.0:
nxR=ii+1
2. xCom, dtaCri and xcTes are of type() numpy.float64, float and numpy.ndarray respectively
3. The full block above is repeated for nyTes and nzTes i.e. a total of three blocks are done in the main algorithm loop. The goal is to create a region of interest with window size dtaCri and center at comparison point xCom using positional data from xcTes
4. The above code is more or less a straight port from Matlab wherein the same block executes at somewhere around three to four times the speed.
5. Question: Is it possible to optimize the block above with respect to execution time and if so how?
6. So far I have tried some minor tweaks such as altering data types and using range() instead of xrange() from which I saw no noticeable changes in performance.

Pre-compute those boolean conditional outputs before going into the loop in a vectorized manner and making use of slicing, which are just views into the input array, like so -
parte1 = ( (xCom-dtaCri-xcTes[:nxTes-2]) * (xCom-dtaCri-xcTes[1:nxTes-1]) ) <=0.0
parte2 = ( (xCom+dtaCri-xcTes[:nxTes-2]) * (xCom+dtaCri-xcTes[1:nxTes-1]) ) <=0.0
We could see few computations are repeated. So, we could use some re-use there -
p = xCom-xcTes[:nxTes-1]
p0 = p - dtaCri
p1 = p + dtaCri
parte1 = p0[:-1]*p0[1:] <= 0.0
parte2 = p1[:-1]*p1[1:] <= 0.0
Then, just use those bools in the loop -
for ii in xrange(0,(nxTes-2)):
if parte1[ii]:
nxL=ii
if parte2[ii]:
nxR=ii+1
The idea is to do minimal work inside the loop with focus on performance.
I am assuming you have more work going in the loop that is using nxL and nxR, because otherwise we are overwriting values into those two variables.

Related

Fast calculation of sum for function defined over range of integers - (0,2^52)

I was looking at the code for a particular cryptocurrency casino game (EthCrash - if you're interested). The game generates crash points using a function (I call this crash(x)) where x is an integer that is randomly drawn from the space of integers (0,2^52).
I'd like to calculate the expected value of the crash points. The code below should explain everything, but a clean picture of the function is here: https://i.imgur.com/8dPBALa.png, and what I'm trying to calculate is here: https://i.imgur.com/nllykDQ.png (apologies - can't paste pictures yet).
I wrote the following code:
import math
two52 = 2**52
def crash(x):
crash_point = math.floor((100*two52-x)/(two52-x))
return(crash_point/100)
crashes_sum = 0
for i in range(two52+1):
crashes_sum += crash(i)
expected_crash = crashes_sum/two52
Unfortunately, the loop is taking too long to run - any ideas for how I can do this faster?

ok, if you cannot do it straightforward, time to get smart, right?
So idea to get ranges where whole sum could be computed fast. I will put some pseudocode which not even compiles, could have bugs etc. Use it as illustration.
First, lets rewrite the term in the sum as
floor( 100 + 99*x/(252 - x) )
First idea - get ranges where floor is not changing due to the fact that term
n =< 99*x/(252 - x) < n+1. Obviously, for this whole range we could add to sum range_length*(100 + n), no need to do it term by term
sum = 0
r_lo = 0
for k in range(0, 2*52): # LOOP OVER RANGES
r_hi = floor(2**52/(1 + 99/n))
sum += (100 + n -1)*(r_hi - r_lo)
if r_hi-r_lo == 1:
break
r_lo = r_hi + 1
Obviously, range size will shrink till it is equal to 1, and then this method will be useless, we break out. Obviously, by that time each term would be different from previous one by 1 or more.
Ok, second idea - again ranges, where sum is arithmetic series. First we have to find range where increment is equal to 1. Then range where increment is equal to 2, etc. Looks like you have to find roots of quadratic equation for this, but code would be about the same
r_lo = pos_for_increment(1)
t_lo = ... # term at r_lo
for n in range(2, 2*52): # LOOP OVER RANGES
r_hi = pos_for_increment(n) - 1
t_hi = ... # term at r_lo
sum += (t_lo + t_hi)*(r_hi - r_lo) / 2 # arith.series sum
if r_hi > 2**52:
break
r_lo = r_hi + 1
t_lo = t_hi + n
might think about something else, but those tricks are worth trying

Using the map function might help increase the speed since it makes the computation in parallel
import math
two52 = 2**52
def crash(x):
crash_point = math.floor((100*two52-x)/(two52-x))
return(crash_point/100)
crashes_sum = sum(map(crash,range(two52)))
expected_crash = crashes_sum/two52

I have been able to speed up your code by taking advantage of numpy vectorization:
import numpy as np
import time
two52 = 2**52
crash = lambda x: np.floor( ( 100 * two52 - x ) / ( two52 - x ) ) / 100
starttime = time.time()
icur = 0
ispan = 100000
crashes_sum = 0
while icur < two52-ispan:
i = np.arange(icur, icur+ispan, 1)
crashes_sum += np.sum(crash(i))
icur += ispan
crashes_sum += np.sum(crash(np.arange(icur, two52, 1)))
expected_crash = crashes_sum / two52
print(time.time() - starttime)
The trick is to compute the sum on a moving windows to take advantage of numpy's vectorization (written in C). I tried up to 2**30 and it takes 9 seconds on my laptop (and too long for your code to be able to benchmark).
Python is probably not the most suitable language for what you want to do, you may want to try C or Fortran for that (and take advantage of threading).

You will have to use a powerful GPU if you wan't the result within some hours.
A possible CPU implementation
import numpy as np
import numba as nb
import time
two52 = 2**52
loop_to=2**30
#nb.njit(fastmath=True,parallel=True)
def sum_over_crash(two52,loop_to): #loop_to is only for testing performance
crashes_sum = nb.float64(0)
for i in nb.prange(loop_to):#nb.prange(two52+1):
crashes_sum += np.floor((100*two52-i)/(two52-i))/100
return crashes_sum/two52
sum_over_crash(two52,2)#don't measure static compilation overhead
t1=time.time()
sum_over_crash(two52,2**30)
print(time.time()-t1)
This takes 0.57s for on my quadcore i7. eg. 28 days for the whole calculation.

As the calculation can not be minimized mathematically, the only option is to calculate it step by step.
This takes a long time (as stated in other answers). Your best bet on calculating it fast is to use a lower level language than python. Since python is an interpreted language, it is rather slow to calculate this kind of thing.
Additionally you can use multithreading (if availible in the chosen language) to make it even faster.
Cloud Computing is also an option that could be suitable for this, as you are only going to calculate the number once. Amazon and Google (and many more) provide this kind of service for a relatively small fee.
But before performing any of the calculations you need to adjust your formula, as with the way it stands right now, you're going to get a ZeroDivisionError at the very last iteration of your loop.

How to efficiently update np array depending on index and value?

I have an image of the sun, I found center and radius and now I want to process pixels differently if they are inside or outside the disk. The ideal solution would be to imterpolate the parameters of the processing function, in order to smoothly transition from disk to background.
Here is what I'm doing now:
for index,value in np.ndenumerate(sun_img):
if distance.euclidean(index,center) > radius:
sun_img[index] = processing_function(index,value)
Like this it works but it takes forever to compute the image. I'm sure there is a more efficient way to do that. How would you solve this?
Image shape is around (1000, 1000)
Processing_function is basically not doing anything right now: value += 1
The function should be something like a non-linear "step function" with 0.0 value till radius and 1.0 5px after. something like: _______/''''''''''''''''''''' multiplied by the value of the pixel. The slope should be on the value of the radius. I wanna do this in order to enhance the protuberances

Here's a vectorized way leveraging NumPy broadcasting -
m,n = sun_img.shape
I,J = np.ogrid[:m,:n]
sq_dist = (I - center[0])**2 + (J - center[1])**2
valid_mask = sq_dist > radius**2
Now, for a processing_function that just adds 1 to the valid places, defined by the IF-conditional, do -
sun_img[valid_mask] += 1
If you need to implement a custom operation with processing_function that needs those row, column indices, use np.where to get those indices and then iterate through the valid elements, like so -
r,c = np.where(valid_mask)
for index in zip(r,c):
sun_img[index] = processing_function(index,sun_img[r,c])
If you have a lot of such valid places, then computing r,c might make things slow. In that case, directly use the mask, like so -
for index,value in np.ndenumerate(sun_img):
if valid_mask[index]:
sun_img[index] = processing_function(index,value)
Compared to the original code, the benefit is that we have the conditional values pre-computed before going into the loop. The best way again would be to vectorize processing_function itself so that it works on a bigger chunk of data, but that would depend on its implementation.

Speeding up Numpy Masking

I'm still an amature when it comes to thinking about how to optimize. I have this section of code that takes in a list of found peaks and finds where these peaks,+/- some value, are located in a multidimensional array. It then adds +1 to their indices of a zeros array. The code works well, but it takes a long time to execute. For instance it is taking close to 45min to run if ind has 270 values and refVals has a shape of (3050,3130,80). I understand that its a lot of data to churn through, but is there a more efficient way of going about this?
maskData = np.zeros_like(refVals).astype(np.int16)
for peak in ind:
tmpArr = np.ma.masked_outside(refVals,x[peak]-2,x[peak]+2).astype(np.int16)
maskData[tmpArr.mask == False ] += 1
tmpArr = None
maskData = np.sum(maskData,axis=2)

Approach #1 : Memory permitting, here's a vectorized approach using broadcasting -
# Craate +,-2 limits usind ind
r = x[ind[:,None]] + [-2,2]
# Use limits to get inside matches and sum over the iterative and last dim
mask = (refVals >= r[:,None,None,None,0]) & (refVals <= r[:,None,None,None,1])
out = mask.sum(axis=(0,3))
Approach #2 : If running out of memory with the previous one, we could use a loop and use NumPy boolean arrays and that could be more efficient than masked arrays. Also, we would perform one more level of sum-reduction, so that we would be dragging less data with us when moving across iterations. Thus, the alternative implementation would look something like this -
out = np.zeros(refVals.shape[:2]).astype(np.int16)
x_ind = x[ind]
for i in x_ind:
out += ((refVals >= i-2) & (refVals <= i+2)).sum(-1)
Approach #3 : Alternatively, we could replace that limit based comparison with np.isclose in approach #2. Thus, the only step inside the loop would become -
out += np.isclose(refVals,i,atol=2).sum(-1)

Writing a double sum in Python

I am new to StackOverflow, and I am extremely new to Python.
My problem is this... I am needing to write a double-sum, as follows:
The motivation is that this is the angular correction to the gravitational potential used for the geoid.
I am having difficulty writing the sums. And please, before you say "Go to such-and-such a resource," or get impatient with me, this is the first time I have ever done coding/programming/whatever this is.
Is this a good place to use a "for" loop?
I have data for the two indices (n,m) and for the coefficients c_{nm} and s_{nm} in a .txt file. Each of those items is a column. When I say usecols, do I number them 0 through 3, or 1 through 4?
(the equation above)
\begin{equation}
V(r, \phi, \lambda) = \sum_{n=2}^{360}\left(\frac{a}{r}\right)^{n}\sum_{m=0}^{n}\left[c_{nm}*\cos{(m\lambda)} + s_{nm}*\sin{(m\lambda)}\right]*\sqrt{\frac{(n-m)!}{(n+m)!}(2n + 1)(2 - \delta_{m0})}P_{nm}(\sin{\lambda})
\end{equation}

(2) Yes, a "for" loop is fine. As #jpmc26 notes, a generator expression is a good alternative to a "for" loop. IMO, you'll want to use numpy if efficiency is important to you.
(3) As #askewchan notes, "usecols" refers to an argument of genfromtxt; as specified in that documentation, column indexes start at 0, so you'll want to use 0 to 3.
A naive implementation might be okay since the larger factorial is the denominator, but I wouldn't be surprised if you run into numerical issues. Here's something to get you started. Note that you'll need to define P() and a. I don't understand how "0 through 3" relates to c and s since their indexes range much further. I'm going to assume that each (and delta) has its own file of values.
import math
import numpy
c = numpy.getfromtxt("the_c_file.txt")
s = numpy.getfromtxt("the_s_file.txt")
delta = numpy.getfromtxt("the_delta_file.txt")
def V(r, phi, lam):
ret = 0
for n in xrange(2, 361):
for m in xrange(0, n + 1):
inner = c[n,m]*math.cos(m*lam) + s[n,m]*math.sin(m*lam)
inner *= math.sqrt(math.factorial(n-m)/math.factorial(n+m)*(2*n+1)*(2-delta[m,0]))
inner *= P(n, m, math.sin(lam))
ret += math.pow(a/r, n) * inner
return ret
Make sure to write unittests to check the math. Note that "lambda" is a reserved word.

Way of speeding up a vertex distance calculation using numpy

I've been struggling with an algorithm tied to comparisons with 3d triangle vectors. Unfortunately its very slow in places and I've gone back and forth on different methods to try and improve it. One thing I'm struggling with is speeding up a distance calculation.
I have two groups of triangles which have been broken down to three points each of which has a 3d float vector (xyz). The calculations I'm using are :
diffverts = numpy.zeros( ( ntris*3, ntesttris*3, 3 ), dtype = 'float32')
diffverts += triverts.reshape(ntris*3, 1, 3 )
diffverts -= ttriverts.reshape(1, ntesttris*3, 3 )
vertdist = ( diffverts[:,:,0]**2 + diffverts[:,:,1]**2 + diffverts[:,:,2]**2 ) ** 0.5
this calculation is faster than :
diffverts = triverts.reshape(ntris*3, 1, 3 ) - ttriverts.reshape(1, ntesttris*3, 3 )
vertdist = ( diffverts[:,:,0]**2 + diffverts[:,:,1]**2 + diffverts[:,:,2]**2 ) ** 0.5
Is there a faster method to populate the diff vert part (which takes longest) and/or the distance part which is also quite time consuming? This code is called a lot of times due to the number of groups to test. Also, trying to do it just on indexes to the verts causes me other issues with further calculations when trying to get back to some boolean tests (i.e. this is only one of a set of calculations so keeping at the tri point level works best.
I'm using numpy and python

The problem is that brute force testing of all triangles versus eachother takes quadratic time. It is better to use a datastructure which is specialized to perform such computations. Luckily, scipy contains one.
Take a look at scipy.spatial.cKDTree. The help should be self-explanatory.

I think diffverts is taking up enough memory to cause cache misses. Unfortunately while this solution is very elegant, you're probably better off computing the whole distance in one go, to avoid having to save an n*m*3 array of intermediate values. As ugly as it is, I would just do nested for loops.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimization of numpy array traversal - python

Related

Fast calculation of sum for function defined over range of integers - (0,2^52)

How to efficiently update np array depending on index and value?

Speeding up Numpy Masking

Writing a double sum in Python

Way of speeding up a vertex distance calculation using numpy

Categories

Resources