How to speedup this for loop in python - python

I'm dealing with two sets of three large lists of the same size containing longitude, latitude and altitude coordinates in UTM format (see lists below). The arrays contain overlapping coordinates (i.e. longitude and latitude values are equal). If the values in Lon are equal to Lon2 and the values in Lat are equal to Lat2 then I want to calculate the mean altitude at those indexes. However, if they're not equal then the longitude, latitude and altitude values will remain. I only want to replace the overlapping data to one set of longitude and latitude coordinates and calculate the mean at those coordinates.
This is my attempt so far
import numpy as np
Lon = [450000.50,459000.50,460000,470000]
Lat = [5800000.50,459000.50,500000,470000]
Alt = [-1,-9,-2,1]
Lon2 = [450000.50,459000.50,460000,470000]
Lat2 = [5800000.50,459000.50,800000,470000]
Alt2= [-3,-1,-20,2]
MeanAlt = []
appendAlt = MeanAlt.append
LonOverlap = []
appendLon = LonOverlap.append
LatOverlap = []
appendLat = LatOverlap.append
for i, a in enumerate(Lon and Lat and Alt):
for j, b in enumerate(Lon2 and Lat2 and Alt2):
if Lon[i]==Lon2[j] and Lat[i]==Lat2[j]:
MeanAltData = (Alt[i]+Alt2[j])/2
appendAlt(MeanAltData)
LonOverlapData = Lon[i]
appendLat(LonOverlapData)
LatOverlapData = Lat[i]
appendLon(LatOverlapData)
print(MeanAlt) # correct ans should be MeanAlt = [-2.0,-5,1.5]
print(LonOverlap)
print(LatOverlap)
I'm working in a jupyter notebook and my laptop is rather slow so I need to make this code much more efficient. I would appreciate any help on this. Thank you :)

I believe your code can be improved in 2 ways:
Firstly, the usage of tuples instead of lists, as iterating over a tuple is generally faster than iterating over a list.
Secondly, your for loops can be reduced to only one loop that iterates over the indices of the tuples you are going to read. Of course, this assumption holds if and only if all your tuples contain the same amount of items (i.e.: len(Lat) == len(Lon) == len(Alt) == len(Lat2) == len(Lon2) == len(Alt2)).
Here is the improved code (I took the liberty of removing the import numpy statement as it was not being used in the piece of code you provided):
# use of tuples
Lon = (450000.50, 459000.50, 460000, 470000)
Lat = (5800000.50, 459000.50, 500000, 470000)
Alt = (-1, -9, -2, 1)
Lon2 = (40000.50, 459000.50, 460000, 470000)
Lat2 = (5800000.50, 459000.50, 800000, 470000)
Alt2 = (-3, -1, -20, 2)
MeanAlt = []
appendAlt = MeanAlt.append
LonOverlap = []
appendLon = LonOverlap.append
LatOverlap = []
appendLat = LatOverlap.append
# only one loop
for i in range(len(Lon)):
if (Lon[i] == Lon2[i]) and (Lat[i] == Lat2[i]):
MeanAltData = (Alt[i] + Alt2[i]) / 2
appendAlt(MeanAltData)
LonOverlapData = Lon[i]
appendLat(LonOverlapData)
LatOverlapData = Lat[i]
appendLon(LatOverlapData)
print(MeanAlt) # correct ans should be MeanAlt = [-2.0,-5,1.5]
print(LonOverlap)
print(LatOverlap)
I executed this program 1 million times on my laptop. Following my code, the amount of time required for all executions is: 1.41 seconds. On the other hand, with your approach the amount of time it takes is: 4.01 seconds.

This is not 100% functionally equivalent, but I am guessing it is closer to what you actually want:
Lon = [450000.50,459000.50,460000,470000]
Lat = [5800000.50,459000.50,500000,470000]
Alt = [-1,-9,-2,1]
Lon2 = [40000.50,459000.50,460000,470000]
Lat2 = [5800000.50,459000.50,800000,470000]
Alt2= [-3,-1,-20,2]
MeanAlt = []
appendAlt = MeanAlt.append
LonOverlap = []
appendLon = LonOverlap.append
LatOverlap = []
appendLat = LatOverlap.append
ll = dict((str(la)+'/'+str(lo), al) for (la, lo, al) in zip(Lat, Lon, Alt))
for la, lo, al in zip(Lon2, Lat2, Alt2):
al2 = ll.get(str(la)+'/'+str(lo))
if al2:
MeanAltData = (al+al2)/2
appendAlt(MeanAltData)
LonOverlapData = lo
appendLat(LonOverlapData)
LatOverlapData = la
appendLon(LatOverlapData)
print(MeanAlt) # correct ans should be MeanAlt = [-2.0,-5,1.5]
print(LonOverlap)
print(LatOverlap)
Or simpler:
Lon = [450000.50,459000.50,460000,470000]
Lat = [5800000.50,459000.50,500000,470000]
Alt = [-1,-9,-2,1]
Lon2 = [40000.50,459000.50,460000,470000]
Lat2 = [5800000.50,459000.50,800000,470000]
Alt2= [-3,-1,-20,2]
ll = dict((str(la)+'/'+str(lo), al) for (la, lo, al) in zip(Lat, Lon, Alt))
result = []
for la, lo, al in zip(Lon2, Lat2, Alt2):
al2 = ll.get(str(la)+'/'+str(lo))
if al2:
result.append((la, lo, (al+al2)/2))
print(result)
In practice, I would try to start with better structured input data to begin with, making the conversion to dict, or at the very least the "zip()" unnecessary.

Use numpy to vectorize computations. For 1,000,000 long arrays execution time should be on the order of 15-25ms of microseconds if inputs are already numpy.ndarrays and ~140ms if inputs are Python lists.
import numpy as np
def mean_alt(lon, lon2, lat, lat2, alt, alt2):
lon = np.asarray(lon)
lon2 = np.asarray(lon2)
lat = np.asarray(lat)
lat2 = np.asarray(lat2)
alt = np.asarray(alt)
alt2 = np.asarray(alt2)
ind = np.where((lon == lon2) & (lat == lat2))
mean_alt = (0.5 * (alt[ind] + alt2[ind])).tolist()
return (lon[ind].tolist(), lat[ind].tolist(), mean_alt)

Related

Python inaccurate lat/lon calculations?

I have a code that takes the starting lat/lon, a bearing (direction), and then distance (in km) to find the new lat lon on a spherical earth. The code looks like:
get_new_lat_lon_from_distance_bearing_lat_lon(lat0,lon0,bearing,d):
import math
# Quick constant
R = 6378.1
# Convert to radians
lat1 = math.radians(lat0)
lon1 = math.radians(lon0)
brng = math.radians(bearing)
# Do some math for lat and lon
lat2 = math.asin( math.sin(lat1)*math.cos(d/R) + math.cos(lat1)*math.sin(d/R)*math.cos(brng))
lon2 = lon1 + math.atan2(math.sin(brng)*math.sin(d/R)*math.cos(lat1),math.cos(d/R)-math.sin(lat1)*math.sin(lat2))
# Reconvert to degrees
lat2 = math.degrees(lat2)
lon2 = math.degrees(lon2)
return lat2,lon2
I can then call this such that:
lat_s,lon_s = get_new_lat_lon_from_distance_bearing_lat_lon(yll,xll,180,cellsize*r)
lat_e,lon_e = get_new_lat_lon_from_distance_bearing_lat_lon(yll,xll,90,cellsize*c)
where yll = 130, and xll = 55
I want to move every 500m from this starting lat/lon position south (bearing = 180) and then also to the east (bearing = 90). The East direction should be 14,000 times, and then the South should be traversed 7000 times. In other words, we can loop through these such that:
nrows = 7000
ncols = 14000
c = 0.5 (0.5km)
# Pre-allocate
biglats = []
biglons = []
# Traverse south
for r in range(0,nrows):
lat_s,lon_s = get_new_lat_lon_from_distance_bearing_lat_lon(yll,xll,180,cellsize*r)
biglats.append(lat_s)
# Traverse East
for c in range(0,ncols):
lat_e,lon_e = get_new_lat_lon_from_distance_bearing_lat_lon(yll,xll,90,cellsize*c)
biglons.append(lon_e)
However, when I print the first and last values of each:
55.000000000147004
23.56327426583246
-130.00000000007
-56.372396480687385
23.56 should be 20, and -56.37 should be -60. The end goal would be to create a meshgrid of lat/lon with a [14000,7000] array. However, the calculations are wrong. What could be done to get a more correct lat/lon and/or some sort of 'meshgrid' of 14000,7000 values of lat/lon equally spaced 500m apart given the starting lat/lon provided?

Finding the distance (Haversine) between all elements in a single dataframe

I currently have a dataframe which includes five columns as seen below. I group the elements of the original dataframe such that they are within a 100km x 100km grid. For each grid element, I need to determine whether there is at least one set of points which are 100m away from each other. In order to do this, I am using the Haversine formula and calculating the distance between all points within a grid element using a for loop. This is rather slow as my parent data structure can have billions of points, and each grid element millions. Is there a quicker way to do this?
Here is a view into a group in the dataframe. "approx_LatSp" & "approx_LonSp" are what I use for groupBy in a previous function.
print(group.head())
Time Lat Lon approx_LatSp approx_LonSp
197825 1.144823 -69.552576 -177.213646 -70.0 -177.234835
197826 1.144829 -69.579416 -177.213370 -70.0 -177.234835
197827 1.144834 -69.606256 -177.213102 -70.0 -177.234835
197828 1.144840 -69.633091 -177.212856 -70.0 -177.234835
197829 1.144846 -69.659925 -177.212619 -70.0 -177.234835
This group is equivalent to one grid element. This group gets passed to the following function which seems to be the crux of my issue (from a performance perspective):
def get_pass_in_grid(group):
'''
Checks if there are two points within 100m
'''
check_100m = 0
check_1km = 0
row_mins = []
for index, row in group.iterrows():
# Get distance
distance_from_row = get_distance_lla(row['Lat'], row['Lon'], group['Lat'].drop(index), group['Lon'].drop(index))
minimum = np.amin(distance_from_row)
row_mins = row_mins + [minimum]
array = np.array(row_mins)
m_100 = array[array < 0.1]
km_1 = array[array < 1.0]
if m_100.size > 0:
check_100m = 1
if km_1.size > 0:
check_1km = 1
return check_100m, check_1km
And the Haversine formula is calculated as follows
def get_distance_lla(row_lat, row_long, group_lat, group_long):
def radians(degrees):
return degrees * np.pi / 180.0
global EARTH_RADIUS
lon1 = radians(group_long)
lon2 = radians(row_long)
lat1 = radians(group_lat)
lat2 = radians(row_lat)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
c = 2 * np.arcsin(np.sqrt(a))
# calculate the result
return(c * EARTH_RADIUS)
One way in which I know I can improve this code is to stop the for loop if the 100m is met for any two points. If this is the only way to improve the speed then I will apply this. But I am hoping there is a better way to resolve my problem. Any thoughts are greatly appreciated! Let me know if I can help to clear something up.
Convert all points to carthesian coordinates to have much easier task (distance of 100m is small enough to disregard that Earth is not flat)
Divide each grid into NxN subgrids (20x20, 100x100? check what is faster), for each point determine in which subgrid it is located. Determine distances within smaller subgrids (and their neighbours) instead of searching whole grid.
Use numpy to vectorize calculations (doing point no1 will definitely help you)
Thanks #Corralien for his advice. I was able to use the BallTree in order to quickly find the closest elements. Improvement is something like 100x over my original code from a performance standpoint. Here is the new get_pass_in_grid:
def get_pass_in_grid(group):
'''
Checks if there is a pass within 100m to meet SWE L-Band requirement
'''
check_100m = 0
check_1km = 0
if len(group) < 2:
return check_100m, check_1km
row_mins = []
group['Lat'] = np.deg2rad(group['Lat'])
group['Lon'] = np.deg2rad(group['Lon'])
temp = np.array([group['Lat'],group['Lon']]).T
tree = BallTree(temp, leaf_size=2, metric='haversine')
for _, row in group.iterrows():
# Get distance
row_arr = np.array([row['Lat'], row['Lon']]).reshape((-1,2))
closest_elem_lst, _ = tree.query(row_arr, k=2)
# First element is always just this one (since d=0)
closest_elem = closest_elem_lst[0,1] * EARTH_RADIUS
row_mins = row_mins + [closest_elem]
if closest_elem < 0.1:
break
array = np.array(row_mins)
m_100 = array[array < 0.1]
km_1 = array[array < 1.0]
if m_100.size > 0:
check_100m = 1
if km_1.size > 0:
check_1km = 1
return check_100m, check_1km

Finding the minimal distance between two coordinates from different lists

Sorry in advance, currently on mobile!
So I basically have one list with around 50,000 lat/long tuples (list 1) and another one with around 1,800 lat/long tuples (list 2).
What I want to do is the following:
For each of the list elements in list 1, I want to I want to find the closest point out of the list elements in list 2, so that I basically end up with a list of around 50,000 values that represent the minimal distances.
I did not have any issues in calculating the distance for single elements using geopy.distance, however, I am stuck with the for loop implementation and appreciate any help!
Thanks a lot.
from math import sin, cos, sqrt, atan2
def distanceCheck(lat1, lat2, lon1, lon2):
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
distance = R * c
return distance
distarr = []
for p1 in list1:
minDist = None
point = None
for p2 in list2:
#DISTANCE CHECK HERE -
check = distanceCheck(p1.lat, p2.lat, p1.lon, p2.lon)
if not minDist:
minDist = check
point = p2
else:
if check < minDist:
minDist = check
point = p2
distarr.append({'min': minDist, 'to': point, 'from': p1})
print("{}".format(distarr))
list1 and list2 are the lists with lat and lon. Hope this helps

Python: Efficient way to get grid dimensions from lat/lon coordinates

I need a very efficient way to extract the grid dimensions (i.e. the x/y index of a 2d array) of a set of latitude/longitude numpy arrays. In the past I have done this by calculating the great circle distance (using the harversine formula) between all the grid cells and the lat/lon corrdinates and then finding the minimum value index (basically looking for the nearest point). Here is a link to a zip file containg the numpy grid arrays for this question. These arrays are originally from a netCDF file that has a curvilinear grid.
Download numpy grid arrays
This works fine but I need to do for this for 300 million points so this method is going to be far too slow (will take over a year). Here's my current approach and some other method I've tried.
import numpy as np
import math
import time
from scipy.spatial import cKDTree
# load the numpy lat/lon grids
lat_array = np.load('../data/lat_array.npy')
lon_array = np.load('../data/lon_array.npy')
# get the array shape dimensions for later conversion
grid_shape = lat_array.shape
# test for lat/lon coordinate
lat = -32
lon = 154
N = 3e8 # how many lat/lon pairs I need to do
Harversine method
# Harversine method
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
# harversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2.)**2. + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2.)**2.
c = 2. * math.asin(math.sqrt(a))
km = 6371. * c # radius of earth
return km
# test harversine method
print('\nHarversine results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
dist_array = np.asarray([haversine(lon, lat, grid_lon, grid_lat) for grid_lon, grid_lat in zip(lon_array.flatten(), lat_array.flatten())])
# get the index of the minium value
min_idx = np.argmin(dist_array)
# transform the index back into grid cell dimensions
grid_dims = np.unravel_index(min_idx, grid_shape)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 2))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
Single iteration time in seconds: 0.13
N iterations time in days: 443.94
Grid coordinate: (179, 136)
Results pass checks! :)
Tunnel Distance
# tunnel distance method
def tunnel_fast(latvar, lonvar, lat0, lon0):
"""
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
"""
rad_factor = math.pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = np.cos(latvals), np.cos(lonvals)
slat,slon = np.sin(latvals), np.sin(lonvals)
delX = np.cos(lat0_rad)*np.cos(lon0_rad) - clat*clon
delY = np.cos(lat0_rad)*np.sin(lon0_rad) - clat*slon
delZ = np.sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return (iy_min,ix_min)
# test tunnel distance method
print('\nTunnel distance results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
grid_dims = tunnel_fast(lat_array, lon_array, lat, lon)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 5))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks! :(')
Output
Tunnel distance results
Single iteration time in seconds: 0.00667
N iterations time in days: 23.15
Grid coordinate: (179, 136)
Results pass checks! :)
Alternative Harversine Approach
# alt harversine method
def haversine_numba(s_lat, s_lng, e_lat, e_lng):
"""
https://towardsdatascience.com/better-parallelization-with-numba-3a41ca69452e
"""
# approximate radius of earth in km
R = 6371.0
s_lat = np.deg2rad(s_lat)
s_lng = np.deg2rad(s_lng)
e_lat = np.deg2rad(e_lat)
e_lng = np.deg2rad(e_lng)
d = np.sin((e_lat - s_lat)/2)**2 + \
np.cos(s_lat)*np.cos(e_lat) * \
np.sin((e_lng - s_lng)/2)**2
return 2 * R * np.arcsin(np.sqrt(d))
# test harversine numba method
print('\nAlt Numba Harversine results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
dist_array = np.asarray([haversine_numba(lon, lat, grid_lon, grid_lat) for grid_lon, grid_lat in zip(lon_array.flatten(), lat_array.flatten())])
# get the index of the minium value
min_idx = np.argmin(dist_array)
# transform the index back into grid cell dimensions
grid_dims = np.unravel_index(min_idx, grid_shape)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 2))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
Alt Numba Harversine results
Single iteration time in seconds: 1.26
N iterations time in days: 4364.29
Grid coordinate: (179, 136)
Results pass checks! :)
Kdtree Method
# kdtree method
def kdtree_fast(latvar,lonvar,lat0,lon0):
"""
Adapted from:
https://github.com/Unidata/python-workshop/blob/fall-2016/notebooks/netcdf-by-coordinates.ipynb
"""
rad_factor = math.pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
clat,clon = np.cos(latvals), np.cos(lonvals)
slat,slon = np.sin(latvals), np.sin(lonvals)
# Build kd-tree from big arrays of 3D coordinates
triples = list(zip(np.ravel(clat*clon), np.ravel(clat*slon), np.ravel(slat)))
kdt = cKDTree(triples)
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
clat0,clon0 = np.cos(lat0_rad), np.cos(lon0_rad)
slat0,slon0 = np.sin(lat0_rad), np.sin(lon0_rad)
dist_sq_min, minindex_1d = kdt.query([clat0*clon0, clat0*slon0, slat0])
iy_min, ix_min = np.unravel_index(minindex_1d, latvals.shape)
return (iy_min, ix_min)
# test kdtree method
print('\nKD Tree method results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
grid_dims = kdtree_fast(lat_array, lon_array, lat, lon)
# get the index of the minium value
min_idx = np.argmin(dist_array)
# transform the index back into grid cell dimensions
grid_dims = np.unravel_index(min_idx, grid_shape)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 2))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
KD Tree method results
Single iteration time in seconds: 0.13
N iterations time in days: 438.42
Grid coordinate: (179, 136)
Results pass checks! :)
Kdtree (constructed outside of function)
Approach suggested by max9111
# KD tree method alt method
def kdtree_process(kdt,lat0,lon0):
"""
Adapted from:
https://github.com/Unidata/python-workshop/blob/fall-2016/notebooks/netcdf-by-coordinates.ipynb
"""
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
clat0,clon0 = np.cos(lat0_rad), np.cos(lon0_rad)
slat0,slon0 = np.sin(lat0_rad), np.sin(lon0_rad)
dist_sq_min, minindex_1d = kdt.query([clat0*clon0, clat0*slon0, slat0])
iy_min, ix_min = np.unravel_index(minindex_1d, latvals.shape)
return (iy_min, ix_min)
# produce kd_tree outside of the function
rad_factor = math.pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = lat_array[:] * rad_factor
lonvals = lon_array[:] * rad_factor
ny,nx = latvals.shape
clat,clon = np.cos(latvals), np.cos(lonvals)
slat,slon = np.sin(latvals), np.sin(lonvals)
# Build kd-tree from big arrays of 3D coordinates
triples = list(zip(np.ravel(clat*clon), np.ravel(clat*slon), np.ravel(slat)))
kdt = cKDTree(triples)
print('\nKD Tree alternative method results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
grid_dims = kdtree_process(kdt, lat, lon)
toc = time.perf_counter()
print('Single iteration time in seconds:', round(toc - tic, 5))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
KD Tree alternative method results
Single iteration time in seconds: 0.00018
N iterations time in days: 0.63
Grid coordinate: (179, 136)
Results pass checks! :)
Does anyone have a faster way to get the grid cell dimensions or maybe I need to rethink my whole approach?

Python: Connected components on a sphere

I have been banging my head against this for some time now. My problem is very simple to explain:
I have data containing longitudes and latitudes. For simplicity, let us assume these are coordinates of cities. What I want is to separate these city coordinates into groups, so that all cities within a group lie within a given 'maximum distance' to it's nearest neighbour. All cities within a group must have at least one neighbour within this distance limit. The minimum distance between these separated groups is therefore greater than 'maximum distance' mentioned above.
My understanding is that this is a clustering problem (e.g. minimum spanning tree). The distance on the sphere can be calculated with the haversine distance, but I can't wrap my head around how to implement this...my restriction are that I can only use numpy, scipy, and scikit-learn.
I hope someone can help
thanks
Ok, so I have implemented a brute force approach to solve this. I am not 100% sure if the results are correct in all cases, though...if some of you have time to check this, it would be greatly appreciated.
import numpy as np
import matplotlib.pyplot as plt
# -------------------------------------------------------------------
def distance_sphere(lon1, lat1, lon2, lat2):
# Calculate distance on sphere
return np.degrees(np.arccos(np.sin(np.radians(lat1)) * np.sin(np.radians(lat2)) +
np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) *
np.cos(np.radians(lon1 - lon2))))
# -------------------------------------------------------------------
def distance_euclid(lon1, lat1, lon2, lat2):
# Calculate distance
return np.sqrt((lon1 - lon2)**2 + (lat1 - lat2)**2)
# -------------------------------------------------------------------
# Maximum allowed distance in degrees
max_distance = 10
# Generate city coordinates
lon_all = np.random.random(100) * 100
lat_all = np.random.random(100) * 100
# Start with as many groups as cities
group = np.arange(len(lon_all))
# Loop over all city coordinates
for lon, lat in zip(lon_all, lat_all):
# Calculate distance to all other cities
dis = distance_euclid(lon1=lon, lat1=lat, lon2=lon_all, lat2=lat_all)
# Get index of those which are within the given limits
idx = np.where(dis <= max_distance)[0]
# If there is no other city, we continue
if len(idx) == 0:
continue
# Set common group for all cities within the limits
for i in idx:
group[group == group[i]] = min(group[idx])
# Rewrite labels starting with 0
for old, new in zip(set(group), range(len(set(group)))):
idx = [i for i, j in enumerate(group) if j == old]
group[idx] = new
# -------------------------------------------------------------------
# Plot results
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[10, 10])
for g, lon, lat in zip(group, lon_all, lat_all):
ax.annotate(str(g), xy=(lon, lat), xycoords="data", size=12, ha="center", va="center")
circ = plt.Circle((lon, lat), radius=max_distance/2, lw=0, color="gray")
ax.add_patch(circ)
ax.set_xlim(-10, 110)
ax.set_ylim(-10, 110)
plt.show()
From the graphical output as it stands in your answer, I believe that your clusters are being terminated prematurely. This is my approach to the problem; the code is ugly because really I just wanted to demonstrate the concept and I don't have time to think about the most elegant way to illustrate this. Also, it's not in numpy because then I could steal my old distance calculation function to save me some time. Hopefully the concept though is clear enough and you'll see how it could be made faster and cleaner e.g. not repeatedly rebuilding available_locations and maybe not re-scanning items in the cluster from previous iteration.
Edit: Illustrated behaviour:
1) Always converges on same solution for each DISTANCE_CAP regardless of all the randomisation in the initialisation and progression of the solution
2) Modifying DISTANCE_CAP can result in single-location clusters or a giant blob
import math
from random import choice, shuffle
DISTANCE_CAP = 20
def crow_flies(lat1, lon1, lat2, lon2):
dx1,dy1 = (lat1/180)*3.141593,(lon1/180)*3.141593
dx2,dy2 = (lat2/180)*3.141593,(lon2/180)*3.141593
dlat,dlon = abs(dx2-dx1),abs(dy2-dy1)
a = (math.sin(dlat/2))**2 + (math.cos(dx1) * math.cos(dx2)
* (math.sin(dlon/2))**2)
c = 2*(math.atan2(math.sqrt(a),math.sqrt(1-a)))
km = 6373 * c
return km
# Aim: separate these back out
manchester = [[53.486286, -2.251476, 1],
[53.483586, -2.254534, 2],
[53.475158, -2.248011, 3],
[53.397161, -2.509189, 4]]
stoke = [[53.037375, -2.262903, 5],
[53.031031, -2.199587, 6]]
birmingham = [[52.443368, -1.975714, 7],
[52.429641, -1.902849, 8],
[52.483326, -1.817483, 9]]
# Mix them all together
combined_list = [item for item in manchester]
for item in stoke:
combined_list.append(item)
for item in birmingham:
combined_list.append(item)
shuffle(combined_list)
# Build a matrix:
matrix = {}
for item in combined_list:
for pair_item in combined_list:
if item[2] != pair_item[2]:
distance = crow_flies(item[0], item[1], pair_item[0], pair_item[1])
matrix[(item[2], pair_item[2])] = distance
# pick a random starting location
available_locations = [combined_list[x][2] for x in range(len(combined_list))]
start_loc = choice(available_locations)
available_locations = [a for a in available_locations if a != start_loc]
all_clusters = []
single_cluster = []
single_cluster.append(start_loc)
# RECURSIVELY add items to our cluster until it cannot get larger, then start a
# new one
cluster_got_bigger = True
while available_locations:
if cluster_got_bigger == True:
cluster_got_bigger = False
for loc in single_cluster:
for item in available_locations:
distance = matrix[(loc, item)]
if distance < DISTANCE_CAP:
single_cluster.append(item)
available_locations = [a for a in available_locations if a != item]
cluster_got_bigger = True
if cluster_got_bigger == False:
all_clusters.append(single_cluster)
single_cluster = []
new_seed = choice(available_locations)
single_cluster.append(new_seed)
available_locations = [a for a in available_locations if a != new_seed]
cluster_got_bigger = True
if not available_locations:
all_clusters.append(single_cluster)
print all_clusters
May be my answer is too late.
But a quick solution is to construct a network data-structure from your cities and get the connected components of your graph:
Each city is a node
There is an edge between two cities if their inter-distance is lower than some threshold
Finally, use some python network module (i.e NetworkX).
The code will be something like this:
import networkx as nx
graph = nx.Graph()
# Add all vertices (cities) to the graph
for i, city in enumerate(cities):
graph.add_vertex(i)
# Add edges between cities that lie under a distance threshold
for i, city_one in enumerate(cities):
for j, city_two in enumerate(cities):
if j > i:
link_exists = calculate_distance(city_one, city_two) < threshold
if link_exists:
graph.add_edge(i,j)
# A list of sets, each set has the indices of cities
components = [c for c in sorted(nx.connected_components(G), reverse=False)]
The calculate_distance and threshold are supposed to be known, the first is a function and the second is the distance threshold.

Categories