Related
I have a very large array with only a few small areas of interest. I need to calculate the gradient of this array, but for performance reasons I need this calculation to be restricted to these areas of interest.
I can't do something like this:
phi_grad0[mask] = np.gradient(phi[mask], axis=0)
Because of how fancy indexing works, phi[mask] just becomes a 1D array of the masked pixels, losing spatial information and making the gradient calculation worthless.
np.gradient does handle np.ma.masked_arrays, but the performance is an order of magnitude worse:
import numpy as np
from timeit_context import timeit_context
phi = np.random.randint(low=-100, high=100, size=[100, 100])
phi_mask = np.random.randint(low=0, high=2, size=phi.shape, dtype=np.bool)
with timeit_context('full array'):
for i2 in range(1000):
phi_masked_grad1 = np.gradient(phi)
with timeit_context('masked_array'):
phi_masked = np.ma.masked_array(phi, ~phi_mask)
for i1 in range(1000):
phi_masked_grad2 = np.gradient(phi_masked)
This produces the output below:
[full array] finished in 143 ms
[masked_array] finished in 1961 ms
I think its because operations run on masked_arrays are not vectorized, but I'm not sure.
Is there any way of restricting np.gradient so as to achieve better performance?
This timeit_context is a handy timer that works like this, if anyone is interested:
from contextlib import contextmanager
import time
#contextmanager
def timeit_context(name):
"""
Use it to time a specific code snippet
Usage: 'with timeit_context('Testcase1'):'
:param name: Name of the context
"""
start_time = time.time()
yield
elapsed_time = time.time() - start_time
print('[{}] finished in {} ms'.format(name, int(elapsed_time * 1000)))
Not exactly an answer, but this is what I've managed to patch together for my situation, which works pretty well:
I get 1D indices of the pixels where the condition is true (in this case the condition being < 5 for example):
def get_indices_1d(image, band_thickness):
return np.where(image.reshape(-1) < 5)[0]
This gives me a 1D array with those indices.
Then I manually calculate the gradient at those positions, in different ways:
def gradient_at_points1(image, indices_1d):
width = image.shape[1]
size = image.size
# Using this instead of ravel() is more likely to produce a view instead of a copy
raveled_image = image.reshape(-1)
res_x = 0.5 * (raveled_image[(indices_1d + 1) % size] - raveled_image[(indices_1d - 1) % size])
res_y = 0.5 * (raveled_image[(indices_1d + width) % size] - raveled_image[(indices_1d - width) % size])
return [res_y, res_x]
def gradient_at_points2(image, indices_1d):
indices_2d = np.unravel_index(indices_1d, dims=image.shape)
# Even without doing the actual deltas this is already slower, and we'll have to check boundary conditions, etc
res_x = 0.5 * (image[indices_2d] - image[indices_2d])
res_y = 0.5 * (image[indices_2d] - image[indices_2d])
return [res_y, res_x]
def gradient_at_points3(image, indices_1d):
width = image.shape[1]
raveled_image = image.reshape(-1)
res_x = 0.5 * (raveled_image.take(indices_1d + 1, mode='wrap') - raveled_image.take(indices_1d - 1, mode='wrap'))
res_y = 0.5 * (raveled_image.take(indices_1d + width, mode='wrap') - raveled_image.take(indices_1d - width, mode='wrap'))
return [res_y, res_x]
def gradient_at_points4(image, indices_1d):
width = image.shape[1]
raveled_image = image.ravel()
res_x = 0.5 * (raveled_image.take(indices_1d + 1, mode='wrap') - raveled_image.take(indices_1d - 1, mode='wrap'))
res_y = 0.5 * (raveled_image.take(indices_1d + width, mode='wrap') - raveled_image.take(indices_1d - width, mode='wrap'))
return [res_y, res_x]
My test arrays look like this:
a = np.random.randint(-10, 10, size=[512, 512])
# Force edges to not pass the condition
a[:, 0] = 99
a[:, -1] = 99
a[0, :] = 99
a[-1, :] = 99
indices = get_indices_1d(a, 5)
mask = a < 5
Then I can run these tests:
with timeit_context('full gradient'):
for i in range(100):
grad1 = np.gradient(a)
with timeit_context('With masked_array'):
for im in range(100):
ma = np.ma.masked_array(a, mask)
grad6 = np.gradient(ma)
with timeit_context('gradient at points 1'):
for i1 in range(100):
grad2 = gradient_at_points1(image=a, indices_1d=indices)
with timeit_context('gradient at points 2'):
for i2 in range(100):
grad3 = gradient_at_points2(image=a, indices_1d=indices)
with timeit_context('gradient at points 3'):
for i3 in range(100):
grad4 = gradient_at_points3(image=a, indices_1d=indices)
with timeit_context('gradient at points 4'):
for i4 in range(100):
grad5 = gradient_at_points4(image=a, indices_1d=indices)
Which give the following results:
[full gradient] finished in 576 ms
[With masked_array] finished in 3455 ms
[gradient at points 1] finished in 421 ms
[gradient at points 2] finished in 451 ms
[gradient at points 3] finished in 112 ms
[gradient at points 4] finished in 102 ms
As you can see method 4 is by far the best (don't care much about how much memory its consuming however).
This probably only holds because my 2D array is relatively small (512x512). Maybe with much larger arrays this won't be true.
Another caveat is that ndarray.take(indices, mode='wrap') will do some weird stuff around the image edges (one row will 'loop' into the next, etc) to maintain good performance, so if edges are ever important for your application you might want to pad the input array with 1 pixel around the edges.
Still super interesting how slow masked_arrays are. Pulling the constructor ma = np.ma.masked_array(a, mask) outside the loop doesn't affect the time since the masked_array itself just keeps references to the array and its mask
In a Monte-Carlo simulation I create many lists of random stick coordinates (actually two coordinate lists per repetition representing two different stick types) in the form [[x0,y0,x1,y1]*N]. By using vectorized numpy methods I tried minimizing the time of creation. However for certain conditions the length of the arrays goes above 10mio and the generation becomes the bottleneck.
The following code gives a minimum example with some test values
import numpy as np
def create_coordinates_vect(dimensions=[1500,2500], length=50, count=12000000, type1_content=0.001):
# two arrays with random start coordinates in area of dimensions
x0 = np.random.randint(dimensions[0], size=count)
y0 = np.random.randint(dimensions[1], size=count)
# random direction of each stick
dirrad = 2 * np.pi * np.random.rand(count)
# to destinguish between type1 and type2 sticks based on random values
stick_type = np.random.rand(count)
is_type1 = np.zeros_like(stick_type)
is_type1[stick_type < type1_content] = True
# calculate end coordinates
x1 = x0 + np.rint(np.cos(dirrad) * length).astype(np.int32)
y1 = y0 + np.rint(np.sin(dirrad) * length).astype(np.int32)
# stack together start and end coordinates
coordinates = np.vstack((x0, y0, x1, y1)).T.astype(np.int32)
# split array according to type
coords_type1 = coordinates[is_type1 == True]
coords_type2 = coordinates[is_type1 == False]
return ([coords_type1, coords_type2])
list1, list2 = create_coordinates_vect()
The timing analysis gives the following results for the different sections
=> x0, y0: 477.3640632629945 ms
=> dirrad, stick_type: 317.4648284911094 ms
=> is_type1: 27.3699760437172 ms
=> x1, y1: 1184.7038269042969 ms
=> vstack: 189.0783309965234 ms
=> coords_type1, coords_type2: 309.9758625035176 ms
I could still gain some time by defining the number of type1 and type2 sticks beforehand instead of doing some random number comparison for each stick. The longer part of creating the random start coordinates and direction plus calculation of the end coordinates would however remain.
Does someone see further optimizations to speed up the creation of the arrays?
As the timings indicate x1 & y1 calculations are the slowest portions of the code. In it, we have cosine and sine computations, scaling with length and then rounding and converting to int32. Now, one of the ways people use to boost up NumPy's performance is with numexpr module.
In our slowest portion, the operations that could be computed with numexpr are sine, cosine and the scaling. Thus, the numexpr modified version of the code would look like this -
import numexpr as ne
x1 = x0 + np.rint(ne.evaluate("cos(dirrad) * length")).astype(np.int32)
y1 = y0 + np.rint(ne.evaluate("sin(dirrad) * length")).astype(np.int32)
Runtime test -
Let's consider (1/100)th shape to the original array shapes. Thus, we have -
dimensions=[15,25]
length=50
count=120000
type1_content=0.001
The initial part of the code stays the same -
# two arrays with random start coordinates in area of dimensions
x0 = np.random.randint(dimensions[0], size=count)
y0 = np.random.randint(dimensions[1], size=count)
# random direction of each stick
dirrad = 2 * np.pi * np.random.rand(count)
# to destinguish between type1 and type2 sticks based on random values
stick_type = np.random.rand(count)
is_type1 = np.zeros_like(stick_type)
is_type1[stick_type < type1_content] = True
Next up, we have two brances for runtime testing purposes - One with the original code and another with the proposed numexpr based approach -
def org_app(x0,y0,dirrad,length):
x1 = x0 + np.rint(np.cos(dirrad) * length).astype(np.int32)
y1 = y0 + np.rint(np.sin(dirrad) * length).astype(np.int32)
def new_app(x0,y0,dirrad,length):
x1 = x0 + np.rint(ne.evaluate("cos(dirrad) * length")).astype(np.int32)
y1 = y0 + np.rint(ne.evaluate("sin(dirrad) * length")).astype(np.int32)
Finally, the runtime test itself -
In [149]: %timeit org_app(x0,y0,dirrad,length)
10 loops, best of 3: 23.5 ms per loop
In [150]: %timeit new_app(x0,y0,dirrad,length)
100 loops, best of 3: 14.6 ms per loop
So, we are looking at about 40% reduction in runtime there, not bad I guess!
I have data for latitude and longitude, and I need to calculate distance matrix between two arrays containing locations. I used this This to get distance between two locations given latitude and longitude.
Here is an example of my code:
import numpy as np
import math
def get_distances(locs_1, locs_2):
n_rows_1 = locs_1.shape[0]
n_rows_2 = locs_2.shape[0]
dists = np.empty((n_rows_1, n_rows_2))
# The loops here are inefficient
for i in xrange(n_rows_1):
for j in xrange(n_rows_2):
dists[i, j] = get_distance_from_lat_long(locs_1[i], locs_2[j])
return dists
def get_distance_from_lat_long(loc_1, loc_2):
earth_radius = 3958.75
lat_dif = math.radians(loc_1[0] - loc_2[0])
long_dif = math.radians(loc_1[1] - loc_2[1])
sin_d_lat = math.sin(lat_dif / 2)
sin_d_long = math.sin(long_dif / 2)
step_1 = (sin_d_lat ** 2) + (sin_d_long ** 2) * math.cos(math.radians(loc_1[0])) * math.cos(math.radians(loc_2[0]))
step_2 = 2 * math.atan2(math.sqrt(step_1), math.sqrt(1-step_1))
dist = step_2 * earth_radius
return dist
My expected output is this:
>>> locations_1 = np.array([[34, -81], [32, -87], [35, -83]])
>>> locations_2 = np.array([[33, -84], [39, -81], [40, -88], [30, -80]])
>>> get_distances(locations_1, locations_2)
array([[ 186.13522573, 345.46610882, 566.23466349, 282.51056676],
[ 187.96657622, 589.43369894, 555.55312473, 436.88855214],
[ 149.5853537 , 297.56950329, 440.81203371, 387.12153747]])
Performance is important for me, and one thing I could do is use Cython to speed up the loops, but it would be nice if I don't have to go there.
Is there a module that can do something like this? Or any other solution?
There's a lot of suboptimal things in the Haversine equations you are using. You can trim some of that and minimize the number of sines, cosines and square roots you need to calculate. The following is the best I have been able to come up with, and on my system runs about 5x faster than Ophion's code (which does mostly the same as far as vectorization goes) on two random arrays of 1000 and 2000 elements:
def spherical_dist(pos1, pos2, r=3958.75):
pos1 = pos1 * np.pi / 180
pos2 = pos2 * np.pi / 180
cos_lat1 = np.cos(pos1[..., 0])
cos_lat2 = np.cos(pos2[..., 0])
cos_lat_d = np.cos(pos1[..., 0] - pos2[..., 0])
cos_lon_d = np.cos(pos1[..., 1] - pos2[..., 1])
return r * np.arccos(cos_lat_d - cos_lat1 * cos_lat2 * (1 - cos_lon_d))
If you feed it your two arrays "as is" it will complain, but that's not a bug, it's a feature. Basically, this function computes the distance on a sphere over the last dimension, and broadcasts on the rest. So you can get what you are after as:
>>> spherical_dist(locations_1[:, None], locations_2)
array([[ 186.13522573, 345.46610882, 566.23466349, 282.51056676],
[ 187.96657622, 589.43369894, 555.55312473, 436.88855214],
[ 149.5853537 , 297.56950329, 440.81203371, 387.12153747]])
But it could also be used to calculate the distances between two lists of points, i.e.:
>>> spherical_dist(locations_1, locations_2[:-1])
array([ 186.13522573, 589.43369894, 440.81203371])
Or between two single points:
>>> spherical_dist(locations_1[0], locations_2[0])
186.1352257300577
This is inspired on how gufuncs work, and once you get used to it, I have found it to be a wonderful "swiss army knife" coding style, that lets you reuse a single function in lots of different settings.
It is more efiicient when using meshgrid to replace the double for loop:
import numpy as np
earth_radius = 3958.75
def get_distances(locs_1, locs_2):
lats1, lats2 = np.meshgrid(locs_1[:,0], locs_2[:,0])
lons1, lons2 = np.meshgrid(locs_1[:,1], locs_2[:,1])
lat_dif = np.radians(lats1 - lats2)
long_dif = np.radians(lons1 - lons2)
sin_d_lat = np.sin(lat_dif / 2.)
sin_d_long = np.sin(long_dif / 2.)
step_1 = (sin_d_lat ** 2) + (sin_d_long ** 2) * np.cos(np.radians(lats1[0])) * np.cos(np.radians(lats2[0]))
step_2 = 2 * np.arctan2(np.sqrt(step_1), np.sqrt(1-step_1))
dist = step_2 * earth_radius
return dist
This is simply vectorizing your code:
def new_get_distances(loc1, loc2):
earth_radius = 3958.75
locs_1 = np.deg2rad(loc1)
locs_2 = np.deg2rad(loc2)
lat_dif = (locs_1[:,0][:,None]/2 - locs_2[:,0]/2)
lon_dif = (locs_1[:,1][:,None]/2 - locs_2[:,1]/2)
np.sin(lat_dif, out=lat_dif)
np.sin(lon_dif, out=lon_dif)
np.power(lat_dif, 2, out=lat_dif)
np.power(lon_dif, 2, out=lon_dif)
lon_dif *= ( np.cos(locs_1[:,0])[:,None] * np.cos(locs_2[:,0]) )
lon_dif += lat_dif
np.arctan2(np.power(lon_dif,.5), np.power(1-lon_dif,.5), out = lon_dif)
lon_dif *= ( 2 * earth_radius )
return lon_dif
locations_1 = np.array([[34, -81], [32, -87], [35, -83]])
locations_2 = np.array([[33, -84], [39, -81], [40, -88], [30, -80]])
old = get_distances(locations_1, locations_2)
new = new_get_distances(locations_1,locations_2)
np.allclose(old,new)
True
If we look at timings:
%timeit new_get_distances(locations_1,locations_2)
10000 loops, best of 3: 80.6 µs per loop
%timeit get_distances(locations_1,locations_2)
10000 loops, best of 3: 74.9 µs per loop
It is actually slower for a small example; however, lets look at a larger example:
locations_1 = np.random.rand(1000,2)
locations_2 = np.random.rand(1000,2)
%timeit get_distances(locations_1,locations_2)
1 loops, best of 3: 5.84 s per loop
%timeit new_get_distances(locations_1,locations_2)
10 loops, best of 3: 149 ms per loop
We now have a speedup of 40x. Can probably squeeze some more speed in a few places.
Edit: Made a few updates to cut out redundant places and make it clear that we are not altering the original location arrays.
Does the Haversine formula provide good enough accuracy for your use? It can be off by quite a bit. I think you'd be able to get both accuracy and speed if you use proj.4, in particular the python bindings, pyproj. Note that pyproj can work directly on numpy arrays of coordinates.
A numerical integration is taking exponentially longer than I expect it to. I would like to know if the way that I implement the iteration over the mesh could be a contributing factor. My code looks like this:
import numpy as np
import itertools as it
U = np.linspace(0, 2*np.pi)
V = np.linspace(0, np.pi)
for (u, v) in it.product(U,V):
# values = computation on each grid point, does not call any outside functions
# solution = sum(values)
return solution
I left out the computations because they are long and my question is specifically about the way that I have implemented the computation over the parameter space (u, v). I know of alternatives such as numpy.meshgrid; however, these all seem to create instances of (very large) matrices, and I would guess that storing them in memory would slow things down.
Is there an alternative to it.product that would speed up my program, or should I be looking elsewhere for the bottleneck?
Edit: Here is the for loop in question (to see if it can be vectorized).
import random
import numpy as np
import itertools as it
##########################################################################
# Initialize the inputs with random (to save space)
##########################################################################
mat1 = np.array([[random.random() for i in range(3)] for i in range(3)])
mat2 = np.array([[random.random() for i in range(3)] for i in range(3)])
a1, a2, a3 = np.array([random.random() for i in range(3)])
plane_normal = np.array([random.random() for i in range(3)])
plane_point = np.array([random.random() for i in range(3)])
d = np.dot(plane_normal, plane_point)
truthval = True
##########################################################################
# Initialize the loop
##########################################################################
N = 100
U = np.linspace(0, 2*np.pi, N + 1, endpoint = False)
V = np.linspace(0, np.pi, N + 1, endpoint = False)
U = U[1:N+1] V = V[1:N+1]
Vsum = 0
Usum = 0
##########################################################################
# The for loops starts here
##########################################################################
for (u, v) in it.product(U,V):
cart_point = np.array([a1*np.cos(u)*np.sin(v),
a2*np.sin(u)*np.sin(v),
a3*np.cos(v)])
surf_normal = np.array(
[2*x / a**2 for (x, a) in zip(cart_point, [a1,a2,a3])])
differential_area = \
np.sqrt((a1*a2*np.cos(v)*np.sin(v))**2 + \
a3**2*np.sin(v)**4 * \
((a2*np.cos(u))**2 + (a1*np.sin(u))**2)) * \
(np.pi**2 / (2*N**2))
if (np.dot(plane_normal, cart_point) - d > 0) == truthval:
perp_normal = plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Vsum += f*differential_area
else:
perp_normal = - plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Usum += f*differential_area
integral = abs(Vsum) + abs(Usum)
If U.shape == (nu,) and (V.shape == (nv,), then the following arrays vectorize most of your calculations. With numpy you get the best speed by using arrays for the largest dimensions, and looping on the small ones (e.g. 3x3).
Corrected version
A = np.cos(U)[:,None]*np.sin(V)
B = np.sin(U)[:,None]*np.sin(V)
C = np.repeat(np.cos(V)[None,:],U.size,0)
CP = np.dstack([a1*A, a2*B, a3*C])
SN = np.dstack([2*A/a1, 2*B/a2, 2*C/a3])
DA1 = (a1*a2*np.cos(V)*np.sin(V))**2
DA2 = a3*a3*np.sin(V)**4
DA3 = (a2*np.cos(U))**2 + (a1*np.sin(U))**2
DA = DA1 + DA2 * DA3[:,None]
DA = np.sqrt(DA)*(np.pi**2 / (2*Nu*Nv))
D = np.dot(CP, plane_normal)
S = np.sign(D-d)
F1 = np.dot(np.dot(SN, mat2.T), plane_normal)
F = F1 * DA
#F = F * S # apply sign
Vsum = F[S>0].sum()
Usum = F[S<=0].sum()
With the same random values, this produces the same values. On a 100x100 case, it is 10x faster. It's been fun playing with these matrices after a year.
In ipython I did simple sum calculations on your 50 x 50 gridspace
In [31]: sum(u*v for (u,v) in it.product(U,V))
Out[31]: 12337.005501361698
In [33]: UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
Out[33]: 12337.005501361693
In [34]: timeit UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
1000 loops, best of 3: 293 us per loop
In [35]: timeit sum(u*v for (u,v) in it.product(U,V))
100 loops, best of 3: 2.95 ms per loop
In [38]: timeit list(it.product(U,V))
1000 loops, best of 3: 213 us per loop
In [45]: timeit UU,VV = np.meshgrid(U,V); (UU*VV).sum().sum()
10000 loops, best of 3: 70.3 us per loop
# using numpy's own sum is even better
product is slower (by factor 10), not because product itself is slow, but because of the point by point calculation. If you can vectorize your calculations so they use the 2 (50,50) arrays (without any sort of looping) it should speed up the overall time. That's the main reason for using numpy.
[k for k in it.product(U,V)] runs in 2ms for me, and the itertool package is made to be efficient, e.g. it does not create a long array first (http://docs.python.org/2/library/itertools.html).
The culprit seems to be your code inside the iteration, or your using a lot of points in linspace.
I have an image processing problem I'm currently solving in python, using numpy and scipy. Briefly, I have an image that I want to apply many local contractions to. My prototype code is working, and the final images look great. However, processing time has become a serious bottleneck in our application. Can you help me speed up my image processing code?
I've tried to boil down our code to the 'cartoon' version below. Profiling suggests that I'm spending most of my time on interpolation. Are there obvious ways to speed up execution?
import cProfile, pstats
import numpy
from scipy.ndimage import interpolation
def get_centered_subimage(
center_point, window_size, image):
x, y = numpy.round(center_point).astype(int)
xSl = slice(max(x-window_size-1, 0), x+window_size+2)
ySl = slice(max(y-window_size-1, 0), y+window_size+2)
subimage = image[xSl, ySl]
interpolation.shift(
subimage, shift=(x, y)-center_point, output=subimage)
return subimage[1:-1, 1:-1]
"""In real life, this is experimental data"""
im = numpy.zeros((1000, 1000), dtype=float)
"""In real life, this mask is a non-zero pattern"""
window_radius = 10
mask = numpy.zeros((2*window_radius+1, 2*window_radius+1), dtype=float)
"""The x, y coordinates in the output image"""
new_grid_x = numpy.linspace(0, im.shape[0]-1, 2*im.shape[0])
new_grid_y = numpy.linspace(0, im.shape[1]-1, 2*im.shape[1])
"""The grid we'll end up interpolating onto"""
grid_step_x = new_grid_x[1] - new_grid_x[0]
grid_step_y = new_grid_y[1] - new_grid_y[0]
subgrid_radius = numpy.floor(
(-1 + window_radius * 0.5 / grid_step_x,
-1 + window_radius * 0.5 / grid_step_y))
subgrid = (
window_radius + 2 * grid_step_x * numpy.arange(
-subgrid_radius[0], subgrid_radius[0] + 1),
window_radius + 2 * grid_step_y * numpy.arange(
-subgrid_radius[1], subgrid_radius[1] + 1))
subgrid_points = ((2*subgrid_radius[0] + 1) *
(2*subgrid_radius[1] + 1))
"""The coordinates of the set of spots we we want to contract. In real
life, this set is non-random:"""
numpy.random.seed(0)
num_points = 10000
center_points = numpy.random.random(2*num_points).reshape(num_points, 2)
center_points[:, 0] *= im.shape[0]
center_points[:, 1] *= im.shape[1]
"""The output image"""
final_image = numpy.zeros(
(new_grid_x.shape[0], new_grid_y.shape[0]), dtype=numpy.float)
def profile_me():
for m, cp in enumerate(center_points):
"""Take an image centered on each illumination point"""
spot_image = get_centered_subimage(
center_point=cp, window_size=window_radius, image=im)
if spot_image.shape != (2*window_radius+1, 2*window_radius+1):
continue #Skip to the next spot
"""Mask the image"""
masked_image = mask * spot_image
"""Resample the image"""
nearest_grid_index = numpy.round(
(cp - (new_grid_x[0], new_grid_y[0])) /
(grid_step_x, grid_step_y))
nearest_grid_point = (
(new_grid_x[0], new_grid_y[0]) +
(grid_step_x, grid_step_y) * nearest_grid_index)
new_coordinates = numpy.meshgrid(
subgrid[0] + 2 * (nearest_grid_point[0] - cp[0]),
subgrid[1] + 2 * (nearest_grid_point[1] - cp[1]))
resampled_image = interpolation.map_coordinates(
masked_image,
(new_coordinates[0].reshape(subgrid_points),
new_coordinates[1].reshape(subgrid_points))
).reshape(2*subgrid_radius[1]+1,
2*subgrid_radius[0]+1).T
"""Add the recentered image back to the scan grid"""
final_image[
nearest_grid_index[0]-subgrid_radius[0]:
nearest_grid_index[0]+subgrid_radius[0]+1,
nearest_grid_index[1]-subgrid_radius[1]:
nearest_grid_index[1]+subgrid_radius[1]+1,
] += resampled_image
cProfile.run('profile_me()', 'profile_results')
p = pstats.Stats('profile_results')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
Vague explanation of what the code does:
We start with a pixellated 2D image, and a set of arbitrary (x, y) points in our image that don't generally fall on an integer grid. For each (x, y) point, I want to multiply the image by a small mask centered precisely on that point. Next we contract/expand the masked region by a finite amount, before finally adding this processed sub-image to a final image, which may not have the same pixel size as the original image. (Not my finest explanation. Ah well).
I'm pretty sure that, as you said, the bulk of the calculation time happens in interpolate.map_coordinates(…), which gets called once for every iteration on center_points, here 10,000 times. Generally, working with the numpy/scipy stack, you want the repetitive task over a large array to happen in native Numpy/Scipy functions -- i.e. in a C loop over homogeneous data -- as opposed to explicitely in Python.
One strategy that might speed up the interpolation, but that will also increase the amount of memory used, is :
First, fetch all the subimages (here named masked_image) in a 3-dimensional array (window_radius x window_radius x center_points.size)
Make a ufunc (read that, it's useful) that wraps the work that has to be done on each subimage, using numpy.frompyfunc, which should return another 3-dimensional array (subgrid_radius[0] x subgrid_radius[1] x center_points.size). In short, this creates a vectorized version of the python function, that can be broadcast element-wise on an array.
Build the final image by summing over the third dimension.
Hope that gets you closer to your goals!