Efficiently find closest points to track in space & time on gridded data - python

Summary/simplified version
Given a list of track points defined by three 1-dimensional arrays (lats, lons and dtime all with same length) and a gridded 3-dimensional array rr (defined by 2-D lat_radar, lon_radar coordinate arrays and a 1-dimensional time array time_radar) I want to extract all the grid values in rr where the coordinates (latitude, longitude AND time included) are closest to the three 1-dimensional arrays.
I've managed to use cKDTree to select points in space but I don't know how to generalize the solution to space & time together. Right now I have to do the selection on time separately and it makes the code quite bulky and hard to read.
for more details about this problem see hereinafter
Extended version
I'm trying to develop an app that uses precipitation data obtained from weather radar composites to predict the precipitation along a track. Most apps usually predict the precipitation at a point without considering the point moving in time.
The idea is, given points identifying a track in space and time, find the closest grid points from radar data to obtain a precipitation estimate over the track (see plot). The final goal would be to shift the start time to identify the best time to leave to avoid rain.
I just optimized my previous algorithm, that was using plain loops, to use cKDTree from scipy. Execution time went down from 30s to 380ms :). However I think the code can still be optimized. Here is my attempt.
As input we have
lons, lats: coordinates of the track as N-dimensional arrays
dtime: timedelta T-dimensional array containing the time elapsed on the track
lon_radar, lat_radar: M x P matrices containing the coordinates of the radar data
dtime_radar: timedelta Q-dimensional array containing the radar forecast
rr: M x P X Q array containing the radar forecast at every time step
First find the grid points closest to the trajectory using cKDTree:
combined_x_y_arrays = np.dstack([lon_radar.ravel(),
lat_radar.ravel()])[0]
points_list = list(np.vstack([lons, lats]).T)
def do_kdtree(combined_x_y_arrays, points):
mytree = cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results = do_kdtree(combined_x_y_arrays, points_list)
# As we have many duplicates, since the itinerary has a much higher resolution than the radar,
# we only select the unique points
inds_itinerary = np.unique(results)
lon_lat_itinerary = combined_x_y_arrays[inds_itinerary]
then find the closest points in the track to subset it. It doesn't make sense to have a track resolution of 10 m if the radar only has grid points every km.
combined_x_y_arrays = np.vstack([lons, lats]).T
points_list = list(lon_lat_itinerary)
results = do_kdtree(combined_x_y_arrays, points_list)
Now we can use these positions to get the elapsed time on the trajectory and the relative time steps in radar data
dtime_itinerary = dtime[results]
# find indices of these dtimes in radar dtime
inds_dtime_radar = np.abs(np.subtract.outer(dtime_radar, dtime_itinerary)).argmin(0)
Now we have everything that we need to find the precipitation so we only need one last loop. I also loop on shifts to obtain prediction with different start times.
shifts = (1, 3, 5, 7, 9)
rain = np.empty(shape=(len(shifts), len(inds_itinerary)))
for i, shift in enumerate(shifts):
temp = []
for i_time, i_space in zip(inds_dtime_radar, inds_itinerary):
temp.append(rr[i_time+shift].ravel()[i_space])
rain[i, :] = temp
In particular I would like to find a way to combine the time search with the lat-lon search for the closest points.

Related

'Lining up' large lat/lon grid with smaller lat/lon grid

Let's say I have a large array of values that represent terrain latitude locations that is shape x. I also have another array of values that represent terrain longitude values that is shape y. All of the values in x as well as y are equally spaced at 0.005-degrees. In other words:
lons[0:10] = [-130.0, -129.995, -129.99, -129.985, -129.98, -129.975, -129.97, -129.965, -129.96, -129.955]
lats[0:10] = [55.0, 54.995, 54.99, 54.985, 54.98, 54.975, 54.97, 54.965, 54.96, 54.955]
I have a second dataset that is projected in an irregularly-spaced lat/lon grid (but equally spaced ~ 25 meters apart) that is [m,n] dimensions big, and falls within the domain of x and y. Furthermore, we also have all of the lat/lon points within this second dataset. I would like to 'lineup' the grids such that every value of [m,n] matches the nearest neighbor terrain value within the larger grid. I am able to do this with the following code where I basically loop through every lat/lon value in dataset two, and try to find the argmin of a the calculated lat/lon values from dataset1:
for a in range(0,lats.shape[0]):
# Loop through the ranges
for r in range(0,lons.shape[0]):
# Access the elements
tmp_lon = lons[r]
tmp_lat = lats[a]
# Now we need to find where the tmp_lon and tmp_lat match best with the index from new_lats and new_lons
idx = (np.abs(new_lats - tmp_lat)).argmin()
idy = (np.abs(new_lons - tmp_lon)).argmin()
# Make our final array!
second_dataset_trn[a,r] = first_dataset_trn[idy,idx]
Except it is exceptionally slow. Is there another method, either through a package, library, etc. that can speed this up?
Please take a look at the following previous question for iterating over two lists, which may improve the speed: Is there a better way to iterate over two lists, getting one element from each list for each iteration?
A possible correction to the sample code: assuming that the arrays are organized in the standard GIS fashion of Latitude, Longitude, I believe there is an error in the idx and idy variable assignments - the variables receiving the assignments should be swapped (idx should be idy, and the other way around). For example:
# Now we need to find where the tmp_lon and tmp_lat match best with the index from new_lats and new_lons
idy = (np.abs(new_lats - tmp_lat)).argmin()
idx = (np.abs(new_lons - tmp_lon)).argmin()

Simulate speakers around a sphere using superposition - speed improvments needed

Note: Drastic speed improvements since posting, see edits at bottom.
I have some working code by it over utilizes loops and I'm pretty sure there should be a faster way of doing it. The size of the output array ends up being pretty large and so when I try to make other arrays the same size of the output, I run out of memory rather quickly.
I am simulating many speakers placed around a sphere all pointing toward the center. I have a simulation of a single speaker and I would like to leverage this single simulation by using the principle of superposition. Basically I want to sum up rotated copies of the single transducer simulation to get my final result.
I have an axisymmetric simulation of acoustic pressure data in cylindrical coordinates ("polar_coord_r", "polar_coord_z"). The pressure field from the simulation is unique at each R and Z value and completely described by a 2D array ("P_real_RZ").
I want to sum together multiple, rotated copies of the this pressure field onto a 3D Cartesian output array. Each copy is rotated to a different location on the sphere. Currently, I am specifying the rotation with an x,y,z point because it allows me to do vector math (spherical coordinates wouldn't allow me to do this as elegantly). The output array is rather large (770 × 770 × 804).
I have some working code to get the output from a single copy of the speaker ("transducer"). It takes about 12 seconds for each slice so it would take over two hours to add each new speaker!! I want to have a dozen or so copies of the speaker so this will take way to long.
The code takes a slice with constant X and computes the R and Z positions at each location in the that slice. "r_distance" is a 2D array containing the radial distance from a line passing between the origin and a point ("point"). Similarlity, "z_distance" is a 2D array containing the distance along that same line.
To get the pressure for the slice, I find the indices of the closest matching "polar_coord_r" and "polar_coord_z" to the computed R distances and Z distances. I use these indices to find what value of pressure (from P_real_RZ) to place at each value in the output.
Some definitions:
xx, yy, and zz are 1D arrays of describing the slices through the output volume
XXX, YYY, and ZZZ are 3D arrays produced by numpy.meshgrid
point is a point which defines the direction that the speaker is rotated. Basically it's just a position vector of the speakers center.
P_real_RZ is a 2D array which specifies the real pressure at each unique R and Z value.
polar_coord_r and polar_coord_z are 1D arrays which define the unique values of R and Z on which P_real_RZ is defined.
current_transducer (only one so far represented in this code) is the pressure values computer for the current point.
output is the result from summing many speakers/transducers together.
Any suggestions to speed up this code is greatly appreciated.
Working loop:
for i, x in enumerate(xx):
# Creates a unit vector from origin to a point
vector = normalize(point)
# Create a slice of the cartesian space with constant X
xyz_slice = np.array([x*np.ones_like(XXX[i,:,:]), YYY[i,:,:], ZZZ[i,:,:]])
# Projects the position vector of each point of the slice onto the unit vector.
projection = np.array(list(map(np.dot, xyz_slice, vector )))
# Normalizes the projection which results in the Z distance along the line passing through the point
#z_distance = np.apply_along_axis(np.linalg.norm, 0, projection) # this is the slow bit
z_distance = np.linalg.norm(projection, axis=0) # I'm an idiot
# Uses vector math to determine the distance from the line
# Each point in the XYZ slice is the sum of vector along the line and the vector away from the line (radial vector).
# By extension the position of the xyz point minus the projection of the point against the unit vector, results in the radial vector
# Norm the radial vector to get the R value for everywhere in the slice
#r_distance = np.apply_along_axis(np.linalg.norm, 0, xyz_slice - projection) # this is the slow bit
r_distance = np.linalg.norm(xyz_slice - projection, axis=0) # I'm an idiot
# Map the pressure data to each point in the slice using the R and Z distance with the RZ pressure slice.
# look for a more efficient way to do this perhaps. currently takes about 12 seconds per slice
r_indices = r_map_v(r_distance) # 1.3 seconds by itself
z_indices = z_map_v(z_distance)
r_indices[r_indices>384] = 384 # find and remove indicies above the max for r_distance
z_indices[r_indices>803] = 803
current_transducer[i,:,:] = P_real_RZ[z_indices, r_indices]
# Sum the mapped pressure data into the output.
output += current_transducer
I have also tried to work with the simulation data in the form of a 3D Cartesian array. That is the pressure data from the simulation for all x, y, and z values the same size as the output.I can rotate this 3D array in one direction (not two rotations needed for speakers arranged on a sphere). This takes up waaaay too much memory and is still painfully slow. I end up getting memory errors with this approach.
Edit: I found a slightly simpler way to do it but it is still slow. I've updated the code above so that there are no longer nested loops.
I ran a line profiler and the slowest lines by far were the two containing np.apply_along_axis(). I'm afraid I might have to rethink how I do this completely.
Final Edit: I initially had a nested loop which I assumed to be the issue. I don't know what made me think I needed to use apply_along_axis with linalg.norm. In any case that was the issue.
I haven't looked for all the ways that you could optimize this code, but this issue jumped out at me: "I ran a line profiler and the slowest lines by far were the two containing np.apply_along_axis()." np.linalg.norm accepts an axis argument. You can replace the line
z_distance = np.apply_along_axis(np.linalg.norm, 0, projection)
with
z_distance = np.linalg.norm(projection, axis=0)
(and likewise for the other use of np.apply_along_axis and np.linalg.norm).
That should improve the performance a bit.

Improving a method for a spatially aware median filter for point clouds in Python

I have point cloud data from airborne LiDAR. It is noisy, so I want to run a median filter which collects points within N metres of each point, finds the median elevation value, and returns the neighbourhood median as an adjusted elevation value.
This is roughly analogous to gridding the data, and taking the median of elevations within each bin. Or scipy.signal.medfilt.
But - I want to preserve the location (x,y) of each point. Also I'm not sure that medfilt preserves the spatial information required.
I have a method, but it involves multiple for loops. Expensive when millions of points go in
Updated - for each iteration of the first loop, a small patch of points is selected for the shapely intersection operation. The first version searched all input points for an intersection at every iteration. Now, only a small patch at a time is converted to a shapely geometry and used for the intersection:
import numpy as np
from shapely import geometry
def spatialmedian(pointcloud,radius):
"""
Using shapely geometires, replace every point in a cloud with the
median value of points within 'radius' units of the point
'pointcloud' must have no more than 3 dimensions (x,y,z)
"""
new_z = []
i = 0
for point in pointcloud:
#pick a point and make it a shapely Point
point = geometry.Point(pointcloud[i,:])
#select a patch around the point and make it a shapely
# MultiPoint
patch = geometry.MultiPoint(list(pointcloud[\
(pointcloud[:,0] > point.x - radius+0.5) &\
(pointcloud[:,0] < point.x + radius+0.5) &\
(pointcloud[:,1] > point.y - radius+0.5) &\
(pointcloud[:,1] < point.y + radius+0.5)\
]))
#buffer the Point by radius
pbuff = point.buffer(radius)
#use the intersection method to find points in our
# patch that lie inside the Point buffer
isect = pbuff.intersection(patch)
#print(isect.geom_type)
#initialise another list
plist = []
#for every intersection set,
# unpack it into a list and collect the median
# Z value.
if isect.geom_type == 'MultiPoint':
#print('point has neightbours')
for p in isect:
plist.append(p.z)
new_z.append(np.median(plist))
else:
# if the intersection set isn't MultiPoint,
# it is an isolated point, whose median Z value
# is it's own.
#print('isolated point')
#append it to the big list
new_z.append(isect.z)
#iterate i
i += 1
#print(i)
#return a list of new median filtered Z coordinates
return new_z
This works by:
ingesting a list/array of XYZ points
the first for loop goes through the list and for every point:
picks out a patch of the point cloud just bigger than the neighbourhood specified
uses shapely to place a 3 metre buffer around the point
finds the intersection of the buffer and the whole point cloud
extracts the set of points from that operation in another for loop
finding the median and appending it to a list of new Z values
returning the list of new Z values
For 10^4 points, I get a result in 11 seconds. For 10^5 points 3 minutes, and most of my datasets run into 2- 5 * 10^6 points. On a 2 * 10^6 point cloud it's been running overnight.
What I want is a faster/more efficient method!
I've been tinkering with python-pcl, which is fast for filtering point clouds, but I don't know how to return indices of points which pass/fail pcl-python filters. I need those indices because each point has other attributes which must remain attached to it.
If anyone can suggest a more efficient method, please do so - I would highly appreciate your help. If it can't go faster and this code is helpful, feel free to use it.
Thanks!
After some good advice, I tried this:
#import numpy and scikit-learn's neighbours modulw
import numpy as np
from sklearn import neighbors as nb
#make a little time ticker
from datetime import datetime
startTime = datetime.now()
# generate a KDTree object. This takes ~95% of the
# processing time
tree = nb.KDTree(xyzi[:,0:3], leaf_size=60)
# how long did tree generation take
print(datetime.now() - startTime)
#initialise a list
new_z = []
#for each point, collect neighbours within radius r
nhoods = tree.query_radius(xyzi[:,0:3], r=3)
# iterate through the list of neighbourhoods,
# find the median height, and add it to the output list
for point in nhoods:
new_z.append(np.median(xyzi[point,2]))
# how long did it take?
print(datetime.now() - startTime)
This version took ~33 minutes for just under two million points. Acceptable, but still could be better.
Can the KDTree generation go faster using a %jit method?
IS there a better method than looping through all the neighbourhoods to find neighbourhood means? here, nhood is an array-of-arrays - I thought something like:
median = np.median(nhoods[:][:,2])
...but it didn't.
Thanks!

Vectorizing for loops in python with numpy multidimensional arrays

I'm trying to improve the performance of this code below. Eventually it will be using much bigger arrays but I thought I would start of with something simple that works then look at where is is slow, optimise it then try it out on the full size. Here is the original code:
#Minimum example with random variables
import numpy as np
import matplotlib.pyplot as plt
n=4
# Theoretical Travel Time to each station
ttable=np.array([1,2,3,4])
# Seismic traces,measured at each station
traces=np.random.random((n, 506))
dt=0.1
# Forward Problem add energy to each trace at the deserired time from a given origin time
given_origin_time=1
for i in range(n):
# Energy will arrive at the sample equivelant to origin time + travel time
arrival_sample=int(round((given_origin_time+ttable[i])/dt))
traces[i,arrival_sample]=2
# The aim is to find the origin time by trying each possible origin time and adding the energy up.
# Where this "Stack" is highest is likely to be the origin time
# Find the maximum travel time
tmax=ttable.max()
# We pad the traces to avoid when we shift by a travel time that the trace has no value
traces=np.lib.pad(traces,((0,0),(round(tmax/dt),round(tmax/dt))),'constant',constant_values=0)
#Available origin times to search for relative to the beginning of the trace
origin_times=np.linspace(-tmax,len(traces),len(traces)+round(tmax/dt))
# Create an empty array to fill with our stack
S=np.empty((origin_times.shape[0]))
# Loop over all the potential origin times
for l,otime in enumerate(origin_times):
# Create some variables which we will sum up over all stations
sum_point=0
sqrr_sum_point=0
# Loop over each station
for m in range(n):
# Find the appropriate travel time
ttime=ttable[m]
# Grap the point on the trace that corresponds to this travel time + the origin time we are searching for
point=traces[m,int(round((tmax+otime+ttime)/dt))]
# Sum up the points
sum_point+=point
# Sum of the square of the points
sqrr_sum_point+=point**2
# Create the stack by taking the square of the sums dived by sum of the squares normalised by the number of stations
S[l]=sum_point#**2/(n*sqrr_sum_point)
# Plot the output the peak should be at given_origin_time
plt.plot(origin_times,S)
plt.show()
I think the problem i dont understand the broacasting and indexing of multidimensional arrays. After this I will be extended the dimensions to search for x,y,z which would be given by increaseing the dimension ttable. I will probably try and implement either pytables or np.memmap to help with the large arrays.
With some quick profiling, it appears that the line
point=traces[m,int(round((tmax+otime+ttime)/dt))]
is taking ~40% of the total program's runtime. Let's see if we can vectorize it a bit:
ttime_inds = np.around((tmax + otime + ttable) / dt).astype(int)
# Loop over each station
for m in range(n):
# Grap the point on the trace that corresponds to this travel time + the origin time we are searching for
point=traces[m,ttime_inds[m]]
We noticed that the only thing changing in the loop (other than m) was ttime, so we pulled it out and vectorized that part using numpy functions.
That was the biggest hotspot, but we can go a bit further and remove the inner loop entirely:
# Loop over all the potential origin times
for l,otime in enumerate(origin_times):
ttime_inds = np.around((tmax + otime + ttable) / dt).astype(int)
points = traces[np.arange(n),ttime_inds]
sum_point = points.sum()
sqrr_sum_point = (points**2).sum()
# Create the stack by taking the square of the sums dived by sum of the squares normalised by the number of stations
S[l]=sum_point#**2/(n*sqrr_sum_point)
EDIT: If you want to remove the outer loop as well, we need to pull otime out:
ttime_inds = np.around((tmax + origin_times[:,None] + ttable) / dt).astype(int)
Then, we proceed as before, summing over the second axis:
points = traces[np.arange(n),ttime_inds]
sum_points = points.sum(axis=1)
sqrr_sum_points = (points**2).sum(axis=1)
S = sum_points # **2/(n*sqrr_sum_points)

Finding n nearest data points to grid locations

I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i].
I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows.
[XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep]
XG = XGrid[:,0,0]
YG = YGrid[0,:,0]
ZG = ZGrid[0,0,:]
XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep.
I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points.
The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones.
for i in range(0,XG.shape):
for j in range(0,YG.shape):
for k in range(0,ZG.shape):
Distance = np.zeros([XD.shape])
for a in range(0,XD.shape):
Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2
B = np.zeros([20], int)
for a in range(0,20):
indx = np.argmin(Distance)
B[a] = indx
Distance[indx] = float(inf)
This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point.
I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.
Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there.
From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.
Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as:
abs(deltaX) + abs(deltaY) + abs(deltaZ)
And save on the expensive power and square roots...
No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass.
Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this:
Allocate each data point to its closest center point.
Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset).
Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).

Categories