I need a very efficient way to extract the grid dimensions (i.e. the x/y index of a 2d array) of a set of latitude/longitude numpy arrays. In the past I have done this by calculating the great circle distance (using the harversine formula) between all the grid cells and the lat/lon corrdinates and then finding the minimum value index (basically looking for the nearest point). Here is a link to a zip file containg the numpy grid arrays for this question. These arrays are originally from a netCDF file that has a curvilinear grid.
Download numpy grid arrays
This works fine but I need to do for this for 300 million points so this method is going to be far too slow (will take over a year). Here's my current approach and some other method I've tried.
import numpy as np
import math
import time
from scipy.spatial import cKDTree
# load the numpy lat/lon grids
lat_array = np.load('../data/lat_array.npy')
lon_array = np.load('../data/lon_array.npy')
# get the array shape dimensions for later conversion
grid_shape = lat_array.shape
# test for lat/lon coordinate
lat = -32
lon = 154
N = 3e8 # how many lat/lon pairs I need to do
Harversine method
# Harversine method
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
# harversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2.)**2. + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2.)**2.
c = 2. * math.asin(math.sqrt(a))
km = 6371. * c # radius of earth
return km
# test harversine method
print('\nHarversine results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
dist_array = np.asarray([haversine(lon, lat, grid_lon, grid_lat) for grid_lon, grid_lat in zip(lon_array.flatten(), lat_array.flatten())])
# get the index of the minium value
min_idx = np.argmin(dist_array)
# transform the index back into grid cell dimensions
grid_dims = np.unravel_index(min_idx, grid_shape)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 2))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
Single iteration time in seconds: 0.13
N iterations time in days: 443.94
Grid coordinate: (179, 136)
Results pass checks! :)
Tunnel Distance
# tunnel distance method
def tunnel_fast(latvar, lonvar, lat0, lon0):
"""
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
"""
rad_factor = math.pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = np.cos(latvals), np.cos(lonvals)
slat,slon = np.sin(latvals), np.sin(lonvals)
delX = np.cos(lat0_rad)*np.cos(lon0_rad) - clat*clon
delY = np.cos(lat0_rad)*np.sin(lon0_rad) - clat*slon
delZ = np.sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return (iy_min,ix_min)
# test tunnel distance method
print('\nTunnel distance results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
grid_dims = tunnel_fast(lat_array, lon_array, lat, lon)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 5))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks! :(')
Output
Tunnel distance results
Single iteration time in seconds: 0.00667
N iterations time in days: 23.15
Grid coordinate: (179, 136)
Results pass checks! :)
Alternative Harversine Approach
# alt harversine method
def haversine_numba(s_lat, s_lng, e_lat, e_lng):
"""
https://towardsdatascience.com/better-parallelization-with-numba-3a41ca69452e
"""
# approximate radius of earth in km
R = 6371.0
s_lat = np.deg2rad(s_lat)
s_lng = np.deg2rad(s_lng)
e_lat = np.deg2rad(e_lat)
e_lng = np.deg2rad(e_lng)
d = np.sin((e_lat - s_lat)/2)**2 + \
np.cos(s_lat)*np.cos(e_lat) * \
np.sin((e_lng - s_lng)/2)**2
return 2 * R * np.arcsin(np.sqrt(d))
# test harversine numba method
print('\nAlt Numba Harversine results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
dist_array = np.asarray([haversine_numba(lon, lat, grid_lon, grid_lat) for grid_lon, grid_lat in zip(lon_array.flatten(), lat_array.flatten())])
# get the index of the minium value
min_idx = np.argmin(dist_array)
# transform the index back into grid cell dimensions
grid_dims = np.unravel_index(min_idx, grid_shape)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 2))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
Alt Numba Harversine results
Single iteration time in seconds: 1.26
N iterations time in days: 4364.29
Grid coordinate: (179, 136)
Results pass checks! :)
Kdtree Method
# kdtree method
def kdtree_fast(latvar,lonvar,lat0,lon0):
"""
Adapted from:
https://github.com/Unidata/python-workshop/blob/fall-2016/notebooks/netcdf-by-coordinates.ipynb
"""
rad_factor = math.pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
clat,clon = np.cos(latvals), np.cos(lonvals)
slat,slon = np.sin(latvals), np.sin(lonvals)
# Build kd-tree from big arrays of 3D coordinates
triples = list(zip(np.ravel(clat*clon), np.ravel(clat*slon), np.ravel(slat)))
kdt = cKDTree(triples)
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
clat0,clon0 = np.cos(lat0_rad), np.cos(lon0_rad)
slat0,slon0 = np.sin(lat0_rad), np.sin(lon0_rad)
dist_sq_min, minindex_1d = kdt.query([clat0*clon0, clat0*slon0, slat0])
iy_min, ix_min = np.unravel_index(minindex_1d, latvals.shape)
return (iy_min, ix_min)
# test kdtree method
print('\nKD Tree method results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
grid_dims = kdtree_fast(lat_array, lon_array, lat, lon)
# get the index of the minium value
min_idx = np.argmin(dist_array)
# transform the index back into grid cell dimensions
grid_dims = np.unravel_index(min_idx, grid_shape)
toc = time.perf_counter()
# report results
print('Single iteration time in seconds:', round(toc - tic, 2))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
KD Tree method results
Single iteration time in seconds: 0.13
N iterations time in days: 438.42
Grid coordinate: (179, 136)
Results pass checks! :)
Kdtree (constructed outside of function)
Approach suggested by max9111
# KD tree method alt method
def kdtree_process(kdt,lat0,lon0):
"""
Adapted from:
https://github.com/Unidata/python-workshop/blob/fall-2016/notebooks/netcdf-by-coordinates.ipynb
"""
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
clat0,clon0 = np.cos(lat0_rad), np.cos(lon0_rad)
slat0,slon0 = np.sin(lat0_rad), np.sin(lon0_rad)
dist_sq_min, minindex_1d = kdt.query([clat0*clon0, clat0*slon0, slat0])
iy_min, ix_min = np.unravel_index(minindex_1d, latvals.shape)
return (iy_min, ix_min)
# produce kd_tree outside of the function
rad_factor = math.pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = lat_array[:] * rad_factor
lonvals = lon_array[:] * rad_factor
ny,nx = latvals.shape
clat,clon = np.cos(latvals), np.cos(lonvals)
slat,slon = np.sin(latvals), np.sin(lonvals)
# Build kd-tree from big arrays of 3D coordinates
triples = list(zip(np.ravel(clat*clon), np.ravel(clat*slon), np.ravel(slat)))
kdt = cKDTree(triples)
print('\nKD Tree alternative method results')
tic = time.perf_counter()
# create a list of distance from the point for all grid cells
grid_dims = kdtree_process(kdt, lat, lon)
toc = time.perf_counter()
print('Single iteration time in seconds:', round(toc - tic, 5))
print('N iterations time in days:', round(((toc - tic)*N)/60/60/24, 2))
print('Grid coordinate:', grid_dims)
if (lon_array.flatten()[min_idx] == lon_array[grid_dims[0], grid_dims[1]]) and (lat_array.flatten()[min_idx] == lat_array[grid_dims[0], grid_dims[1]]):
print('Results pass checks! :)')
else:
print('Results FAIL checks :(')
Output
KD Tree alternative method results
Single iteration time in seconds: 0.00018
N iterations time in days: 0.63
Grid coordinate: (179, 136)
Results pass checks! :)
Does anyone have a faster way to get the grid cell dimensions or maybe I need to rethink my whole approach?
Related
I currently have a dataframe which includes five columns as seen below. I group the elements of the original dataframe such that they are within a 100km x 100km grid. For each grid element, I need to determine whether there is at least one set of points which are 100m away from each other. In order to do this, I am using the Haversine formula and calculating the distance between all points within a grid element using a for loop. This is rather slow as my parent data structure can have billions of points, and each grid element millions. Is there a quicker way to do this?
Here is a view into a group in the dataframe. "approx_LatSp" & "approx_LonSp" are what I use for groupBy in a previous function.
print(group.head())
Time Lat Lon approx_LatSp approx_LonSp
197825 1.144823 -69.552576 -177.213646 -70.0 -177.234835
197826 1.144829 -69.579416 -177.213370 -70.0 -177.234835
197827 1.144834 -69.606256 -177.213102 -70.0 -177.234835
197828 1.144840 -69.633091 -177.212856 -70.0 -177.234835
197829 1.144846 -69.659925 -177.212619 -70.0 -177.234835
This group is equivalent to one grid element. This group gets passed to the following function which seems to be the crux of my issue (from a performance perspective):
def get_pass_in_grid(group):
'''
Checks if there are two points within 100m
'''
check_100m = 0
check_1km = 0
row_mins = []
for index, row in group.iterrows():
# Get distance
distance_from_row = get_distance_lla(row['Lat'], row['Lon'], group['Lat'].drop(index), group['Lon'].drop(index))
minimum = np.amin(distance_from_row)
row_mins = row_mins + [minimum]
array = np.array(row_mins)
m_100 = array[array < 0.1]
km_1 = array[array < 1.0]
if m_100.size > 0:
check_100m = 1
if km_1.size > 0:
check_1km = 1
return check_100m, check_1km
And the Haversine formula is calculated as follows
def get_distance_lla(row_lat, row_long, group_lat, group_long):
def radians(degrees):
return degrees * np.pi / 180.0
global EARTH_RADIUS
lon1 = radians(group_long)
lon2 = radians(row_long)
lat1 = radians(group_lat)
lat2 = radians(row_lat)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
c = 2 * np.arcsin(np.sqrt(a))
# calculate the result
return(c * EARTH_RADIUS)
One way in which I know I can improve this code is to stop the for loop if the 100m is met for any two points. If this is the only way to improve the speed then I will apply this. But I am hoping there is a better way to resolve my problem. Any thoughts are greatly appreciated! Let me know if I can help to clear something up.
Convert all points to carthesian coordinates to have much easier task (distance of 100m is small enough to disregard that Earth is not flat)
Divide each grid into NxN subgrids (20x20, 100x100? check what is faster), for each point determine in which subgrid it is located. Determine distances within smaller subgrids (and their neighbours) instead of searching whole grid.
Use numpy to vectorize calculations (doing point no1 will definitely help you)
Thanks #Corralien for his advice. I was able to use the BallTree in order to quickly find the closest elements. Improvement is something like 100x over my original code from a performance standpoint. Here is the new get_pass_in_grid:
def get_pass_in_grid(group):
'''
Checks if there is a pass within 100m to meet SWE L-Band requirement
'''
check_100m = 0
check_1km = 0
if len(group) < 2:
return check_100m, check_1km
row_mins = []
group['Lat'] = np.deg2rad(group['Lat'])
group['Lon'] = np.deg2rad(group['Lon'])
temp = np.array([group['Lat'],group['Lon']]).T
tree = BallTree(temp, leaf_size=2, metric='haversine')
for _, row in group.iterrows():
# Get distance
row_arr = np.array([row['Lat'], row['Lon']]).reshape((-1,2))
closest_elem_lst, _ = tree.query(row_arr, k=2)
# First element is always just this one (since d=0)
closest_elem = closest_elem_lst[0,1] * EARTH_RADIUS
row_mins = row_mins + [closest_elem]
if closest_elem < 0.1:
break
array = np.array(row_mins)
m_100 = array[array < 0.1]
km_1 = array[array < 1.0]
if m_100.size > 0:
check_100m = 1
if km_1.size > 0:
check_1km = 1
return check_100m, check_1km
I'm trying to create function to geographically divide a region into a equal sized grid of 50x50 (meters). The function needs to return the upper-left and lower-right geographical coordinates of each cell. I'm using numpy:
import numpy as np
upper_right = (33.775353, -111.566165)
lower_right = (33.273915, -111.566165)
upper_left = (33.775353, -112.439578)
lower_left = (33.273915, -112.439578)
cols = np.linspace(lower_left[1], lower_right[1], num=50)
rows = np.linspace(lower_left[0], upper_left[0], num=50)
I don't have any experience with numpy, and in fact that's my first time using it. So I'm not sure if the linspace is the better method for what I'm trying to do. Some guidance would be very helpful.
Update: I've managed to remove the redundancy by calculating the upper_right and lower_right in runtime. Also, I've moved everything to a function that accepts the cell_size (default 50):
import numpy as np
def calculate_grid(upper_left, lower_right, cell_size=50):
upper_right = {'lat': upper_left['lat'], 'lon': lower_right['lon']}
lower_left = {'lat': lower_right['lat'], 'lon': upper_left['lon']}
# cols = np.linspace(lower_left['lon'], lower_right['lon'], num=cell_size)
# rows = np.linspace(lower_left['lat'], upper_left['lat'], num=cell_size)
pass
upper_left = {'lat': 33.775353, 'lon': -112.439578}
lower_right = {'lat': 33.273915, 'lon': -111.566165}
grid = calculate_grid(upper_left, lower_right)
print(grid)
I took the approach of projecting the region to a local transverse mercator coordinate system using pyproj. Then I laid out the 50m grid, and finally convert from transverse mercator back to lat/lon.
import math
import pyproj
import csv
def ll_to_xy(t, lon, lat):
return t.transform(
lon,
lat,
radians=False,
direction=pyproj.enums.TransformDirection.FORWARD)
def xy_to_ll(t, x, y):
lond, latd = t.transform(
x,
y,
radians=False,
direction=pyproj.enums.TransformDirection.INVERSE)
return lond, latd
def generate_cells(xstep, ystep, upper_left, lower_right):
# Transverse mercator coordinate reference system,
# whose origin is in the middle of the region.
lon_0 = upper_left['longitude'] + (lower_right['longitude'] - upper_left['longitude'])/2
lat_0 = lower_right['latitude'] + (upper_left['latitude'] - lower_right['latitude'])/2
geo_crs = pyproj.CRS("EPSG:4326")
tmerc_crs = pyproj.CRS.from_proj4(f'+proj=tmerc +ellps=WGS84 +lon_0={lon_0} +lat_0={lat_0} +units=m +no_defs')
# Lon/lat to tmerc.
ll_to_tmerc = pyproj.Transformer.from_crs(geo_crs, tmerc_crs, always_xy=True)
ul = ll_to_xy(ll_to_tmerc, upper_left['longitude'], upper_left['latitude'])
ur = ll_to_xy(ll_to_tmerc, lower_right['longitude'], upper_left['latitude'])
lr = ll_to_xy(ll_to_tmerc, lower_right['longitude'], lower_right['latitude'])
ll = ll_to_xy(ll_to_tmerc, upper_left['longitude'], lower_right['latitude'])
# Generate a grid, beginning with the upper left point.
grid = []
gx = ul[0]
gy = ul[1]
row = 0
lon, lat = xy_to_ll(ll_to_tmerc, gx, gy)
with open('grid_points.csv', 'w') as gpf:
gpw = csv.writer(gpf)
gpw.writerow(['longitude', 'latitude'])
while lat > lower_right['latitude']:
grid.append([])
while lon <= lower_right['longitude']:
lon, lat = xy_to_ll(ll_to_tmerc, gx, gy)
grid[row].append((lon, lat))
gpw.writerow([lon, lat])
gx += xstep
# Start the next row.
gx = ul[0]
row += 1
gy -= ystep
lon, lat = xy_to_ll(ll_to_tmerc, gx, gy)
# Make cells from the grid points.
cells = []
for i in range(len(grid) - 1):
cells.append([])
for j in range(len(grid[0]) - 1):
cells[i].append({ 'ul' : grid[i][j], 'lr' : grid[i+1][j+1]})
return cells
region = {
'upper_left': {
'latitude': -23.6060507,
'longitude': -46.627016 },
'lower_right': {
'latitude': -23.659132,
'longitude': -46.565758 } }
print('UL', region['upper_left']['longitude'], region['upper_left']['latitude'])
print('LR', region['lower_right']['longitude'], region['lower_right']['latitude'])
cells = generate_cells(50.0, 50.0, region['upper_left'], region['lower_right'])
# Test the cell dimensions with a geodesic to verify
# they are really 50m x 50m.
g = pyproj.Geod(ellps='WGS84')
for (i, row) in enumerate(cells):
for (j, col) in enumerate(row):
# Measure top edge of cell[i][j].
a12, a21, dx = g.inv(
# From upper left corner.
cells[i][j]['ul'][0],
cells[i][j]['ul'][1],
# To upper right corner.
cells[i][j]['lr'][0],
cells[i][j]['ul'][1],
radians=False)
# Measure left edge of cell[i][j].
a12, a21, dy = g.inv(
# From upper left corner.
cells[i][j]['ul'][0],
cells[i][j]['ul'][1],
# to lower left corner.
cells[i][j]['ul'][0],
cells[i][j]['lr'][1],
radians=False)
# Measure diagonal.
a12, a21, dd = g.inv(
# From upper left corner.
cells[i][j]['ul'][0],
cells[i][j]['ul'][1],
# to lower right corner.
cells[i][j]['lr'][0],
cells[i][j]['lr'][1],
radians=False)
h = math.sqrt(dx*dx + dy*dy)
print('cells[{0}][{1}] '.format(i, j))
print(' ul = {0} '.format(cells[i][j]['ul']))
print(' lr = {0} '.format(cells[i][j]['lr']))
print(' dx = {0} dy = {1} dd = {2} h = {3}'.format(dx, dy, dd, h))
if j == 4: break
if i == 4: break
What you want is kind of impossible at least while having the grind lines always oriented north-south and west-east. That is why the grid of the geographical coordinate system does not consist out of squares.
I wrote you code that creates a grid with a col length of approximately 50 m. It will not work near the poles and will have problems with distances crossing -180 and + 180 lon.
import math
meanRadius = 6371.0087714150598 # mean radius of the earth in km
colDistance = 0.05 # 50m circle distance between the points of the grid
northWest = (33.775353, -112.439578)
# The lat angle diffrence between two points 50 m circle distanc away
angleLat = (colDistance * 360) / (2 * math.pi * meanRadius)
latRadius = math.cos(math.radians(northWest[0])) * meanRadius
angleLon = (colDistance * 360) / (2 * math.pi * latRadius)
print("angleLat: {}".format(angleLat))
print("angleLon: {}".format(angleLon))
print("latRadius: {}".format(latRadius))
grid = []
for x in range(5):
grid += [[]]
for y in range(5):
grid[x] += [(northWest[0] - x * angleLat, northWest[1] + y * angleLon)]
print(grid)
Output
angleLat: 0.00044966018387976883
angleLon: 0.0005409617010418813
latRadius: 5295.733450513392
[[(33.775353, -112.439578), (33.775353, -112.43903703829895), (33.775353, -112.43849607659791), (33.775353, -112.43795511489687), (33.775353, -112.43741415319583)], [(33.77490333981612, -112.439578), (33.77490333981612, -112.43903703829895), (33.77490333981612, -112.43849607659791), (33.77490333981612, -112.43795511489687), (33.77490333981612, -112.43741415319583)], [(33.774453679632245, -112.439578), (33.774453679632245, -112.43903703829895), (33.774453679632245, -112.43849607659791), (33.774453679632245, -112.43795511489687), (33.774453679632245, -112.43741415319583)], [(33.77400401944836, -112.439578), (33.77400401944836, -112.43903703829895), (33.77400401944836, -112.43849607659791), (33.77400401944836, -112.43795511489687), (33.77400401944836, -112.43741415319583)], [(33.77355435926448, -112.439578), (33.77355435926448, -112.43903703829895), (33.77355435926448, -112.43849607659791), (33.77355435926448, -112.43795511489687), (33.77355435926448, -112.43741415319583)]]
Full disclaimer I have not checked if the results are plausible there still might be some errors in the math.
For the math behind it and how the coordinate system works consult
https://en.wikipedia.org/wiki/Geographic_coordinate_system
https://en.wikipedia.org/wiki/Great-circle_distance
I have a high frequency of gps data which i want to downsample to every 50 meters ie keep gps latitude and longitude every 50 meter and discard inbetween points. I found a python code on the internet which basically calculates the distance between two points. But i am not sure how to basically read from a csv the lat and long values and feed it into the function and calculate the distance. If the distance reaches 50 meter i simply save that gps coordinates. So far, i have the following python code
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
x1 = 52.19421607
x2 = 52.20000327
y1 = -1.484984011
y2 = -1.48533465
result = haversine(x1,y1,x2,y2) #need to give input from a csv
#if result is greater than 50m , save the coordinates
print(result)
How can i solve the problem?Any direction would be appreciated.
Here is a outline and a working code example - where I made some assumptions about which to keep/drop. I assume the dataframe is sorted.
First calculate distance to next point, indeed use haversine for lat/long pairs. This part is not fast in my implementation - you can find faster.
Use cumsum() of distances, to create distance groups, where group 1 is all distances below 50, group 2 between 50 and 100, etc...
Within each group, keep for instance only the first()
Note that this is approximately each 50 units based on group, so be aware this is different than take a point and jump to next point which is closest to 50 units away and repeat. But for data reduction purposes it should be fine.
Generate some random data around London.
import numpy as np
import sklearn
import pandas as pd
LONDON = (51.509865, -0.118092)
random_gps = np.random.random( (10000,2) ) / 25
random_gps[:,0] += np.arange(random_gps.shape[0]) / 25
random_gps[:,0] += LONDON[0]
random_gps[:,1] += LONDON[1]
gps_data = pd.DataFrame( random_gps, columns=["lat","long"] )
Shift the data to get the lat/long of the next point
gps_data['next_lat'] = gps_data.lat.shift(1)
gps_data['next_long'] = gps_data.long.shift(1)
gps_data.head()
Define the distance metric. This part can be improved in terms of speed by using vector expressions with numpy, so if speed is important change this part.
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
EARTH_RADIUS = 6371.009
def haversine_distance(row):
point_a = np.array([[row.lat, row.long]])
point_b = np.array([[row.next_lat, row.next_long]])
return EARTH_RADIUS * dist.pairwise(np.radians(point_a), np.radians(point_b) )[0][0]
and apply our distance function (slow part, which can be improved)
gps_data["distance_to_next"] = gps_data.apply( haversine_distance, axis=1)
gps_data["distance_cumsum"] = gps_data.distance_to_next.cumsum()
Finally, create groups and drop. (!) The haversine is returning the distance in KM - so here i wrongly did an example of 50 km instead of meters.
gps_data["distance_group"] = gps_data.distance_cumsum // 50
filtered = gps_data.groupby(['distance_group']).first()
We have a class that have three functions called(Bdisk, Bhalo,and BX).
all of these functions accept arrays (e.g. shape (1000))not matrices (e.g. shape (2,1000)).
I want to get the total of all these functions( total= Bdisk + Bhalo+BX), total these all functions give the magnetic field in all three components (B_r, B_phi, B_z) for thousand coordinate points (r, phi, z).
the code is here:
import numpy as np
import logging
import warnings
import gmf
signum = lambda x: (x < 0.) * -1. + (x >= 0) * 1.
pi = np.pi
#Class with analytical functions that describe the GMF according to the model of JF12
class GMF(object):
def __init__(self): # self:is automatically set to reference the newly created object that needs to be initialized
self.Rsun = -8.5 # position of the sun along the x axis in kpc
############################################################################
# Disk Parameters
############################################################################
self.bring, self.bring_unc = 0.1,0.1 # floats, field strength in ring at 3 kpc < r < 5 kpc
self.hdisk, self.hdisk_unc = 0.4, 0.03 # float, disk/halo transition height
self.wdisk, self.wdisk_unc = 0.27,0.08 # floats, transition width
self.b = np.array([0.1,3.,-0.9,-0.8,-2.0,-4.2,0.,2.7]) # (8,1)-dim np.arrays, field strength of spiral arms at 5 kpc
self.b_unc = np.array([1.8,0.6,0.8,0.3,0.1,0.5,1.8,1.8]) # uncertainty
self.rx = np.array([5.1,6.3,7.1,8.3,9.8,11.4,12.7,15.5])# (8,1)-dim np.array,dividing lines of spiral lines coordinates of neg. x-axes that intersect with arm
self.idisk = 11.5 * pi/180. # float, spiral arms pitch angle
#############################################################################
# Halo Parameters
#############################################################################
self.Bn, self.Bn_unc = 1.4,0.1 # floats, field strength northern halo
self.Bs, self.Bs_unc = -1.1,0.1 # floats, field strength southern halo
self.rn, self.rn_unc = 9.22,0.08 # floats, transition radius south, lower limit
self.rs, self.rs_unc = 16.7,0. # transition radius south, lower limit
self.whalo, self.whalo_unc = 0.2,0.12 # floats, transition width
self.z0, self.z0_unc = 5.3, 1.6 # floats, vertical scale height
##############################################################################
# Out of plaxe or "X" component Parameters
##############################################################################
self.BX0, self.BX_unc = 4.6,0.3 # floats, field strength at origin
self.ThetaX0, self.ThetaX0_unc = 49. * pi/180., pi/180. # elev. angle at z = 0, r > rXc
self.rXc, self.rXc_unc = 4.8, 0.2 # floats, radius where thetaX = thetaX0
self.rX, self.rX_unc = 2.9, 0.1 # floats, exponential scale length
# striated field
self.gamma, self.gamma_unc = 2.92,0.14 # striation and/or rel. elec. number dens. rescaling
return
##################################################################################
##################################################################################
# Transition function given by logistic function eq.5
##################################################################################
def L(self,z,h,w):
if np.isscalar(z):
z = np.array([z]) # scalar or numpy array with positions (height above disk, z; distance from center, r)
ones = np.ones(z.shape[0])
return 1./(ones + np.exp(-2. *(np.abs(z)- h)/w))
####################################################################################
# return distance from center for angle phi of logarithmic spiral
# r(phi) = rx * exp(b * phi) as np.array
####################################################################################
def r_log_spiral(self,phi):
if np.isscalar(phi): #Returns True if the type of num is a scalar type.
phi = np.array([phi])
ones = np.ones(phi.shape[0])
# self.rx.shape = 8
# phi.shape = p
# then result is given as (8,p)-dim array, each row stands for one rx
# vstack : Take a sequence of arrays and stack them vertically to make a single array
# tensordot(a, b, axes=2):Compute tensor dot product along specified axes for arrays >=1D.
result = np.tensordot(self.rx , np.exp((phi - 3.*pi*ones) / np.tan(pi/2. - self.idisk)),axes = 0)
result = np.vstack((result, np.tensordot(self.rx , np.exp((phi - pi*ones) / np.tan(pi/2. - self.idisk)),axes = 0) ))
result = np.vstack((result, np.tensordot(self.rx , np.exp((phi + pi*ones) / np.tan(pi/2. - self.idisk)),axes = 0) ))
return np.vstack((result, np.tensordot(self.rx , np.exp((phi + 3.*pi*ones) / np.tan(pi/2. - self.idisk)),axes = 0) ))
#############################################################################################
# Disk component in galactocentric cylindrical coordinates (r,phi,z)
#############################################################################################
def Bdisk(self,r,phi,z):
# Bdisk is purely azimuthal (toroidal) with the field strength b_ring
"""
r: N-dim np.array, distance from origin in GC cylindrical coordinates, is in kpc
z: N-dim np.array, height in kpc in GC cylindrical coordinates
phi:N-dim np.array, polar angle in GC cylindircal coordinates, in radian
Bdisk: (3,N)-dim np.array with (r,phi,z) components of disk field for each coordinate tuple
|Bdisk|: N-dim np.array, absolute value of Bdisk for each coordinate tuple
"""
if (not r.shape[0] == phi.shape[0]) and (not z.shape[0] == phi.shape[0]):
warnings.warn("List do not have equal shape! returning -1", RuntimeWarning)
return -1
# Return a new array of given shape and type, filled with zeros.
Bdisk = np.zeros((3,r.shape[0])) # Bdisk vector in r, phi, z
ones = np.ones(r.shape[0])
r_center = (r >= 3.) & (r < 5.1)
r_disk = (r >= 5.1) & (r <= 20.)
Bdisk[1,r_center] = self.bring
# Determine in which arm we are
# this is done for each coordinate individually
if np.sum(r_disk):
rls = self.r_log_spiral(phi[r_disk])
rls = np.abs(rls - r[r_disk])
arms = np.argmin(rls, axis = 0) % 8
# The magnetic spiral defined at r=5 kpc and fulls off as 1/r ,the field direction is given by:
Bdisk[0,r_disk] = np.sin(self.idisk)* self.b[arms] * (5. / r[r_disk])
Bdisk[1,r_disk] = np.cos(self.idisk)* self.b[arms] * (5. / r[r_disk])
Bdisk *= (ones - self.L(z,self.hdisk,self.wdisk)) # multiplied by L
return Bdisk, np.sqrt(np.sum(Bdisk**2.,axis = 0)) # the Bdisk, the normalization
# axis=0 : sum over index 0(row)
# axis=1 : sum over index 1(columns)
##############################################################################################
# Halo component
###############################################################################################
def Bhalo(self,r,z):
# Bhalo is purely azimuthal (toroidal), i.e. has only a phi component
if (not r.shape[0] == z.shape[0]):
warnings.warn("List do not have equal shape! returning -1", RuntimeWarning)
return -1
Bhalo = np.zeros((3,r.shape[0])) # Bhalo vector in r, phi, z rows: r, phi and z component
ones = np.ones(r.shape[0])
m = ( z != 0. )
# SEE equation 6.
Bhalo[1,m] = np.exp(-np.abs(z[m])/self.z0) * self.L(z[m], self.hdisk, self.wdisk) * \
( self.Bn * (ones[m] - self.L(r[m], self.rn, self.whalo)) * (z[m] > 0.) \
+ self.Bs * (ones[m] - self.L(r[m], self.rs, self.whalo)) * (z[m] < 0.) )
return Bhalo , np.sqrt(np.sum(Bhalo**2.,axis = 0))
##############################################################################################
# BX component (OUT OF THE PLANE)
###############################################################################################
def BX(self,r,z):
#BX is purely ASS and poloidal, i.e. phi component = 0
if (not r.shape[0] == z.shape[0]):
warnings.warn("List do not have equal shape! returning -1", RuntimeWarning)
return -1
BX= np.zeros((3,r.shape[0])) # BX vector in r, phi, z rows: r, phi and z component
m = np.sqrt(r**2. + z**2.) >= 1.
bx = lambda r_p: self.BX0 * np.exp(-r_p / self.rX) # eq.7
thetaX = lambda r,z,r_p: np.arctan(np.abs(z)/(r - r_p)) # eq.10
r_p = r[m] *self.rXc/(self.rXc + np.abs(z[m] ) / np.tan(self.ThetaX0)) # eq 9
m_r_b = r_p > self.rXc # region with constant elevation angle
m_r_l = r_p <= self.rXc # region with varying elevation angle
theta = np.zeros(z[m].shape[0])
b = np.zeros(z[m].shape[0])
r_p0 = (r[m])[m_r_b] - np.abs( (z[m])[m_r_b] ) / np.tan(self.ThetaX0) # eq.8
b[m_r_b] = bx(r_p0) * r_p0/ (r[m])[m_r_b] # the field strength in the constant elevation angle (b_x(r_p)r_p/r)
theta[m_r_b] = self.ThetaX0 * np.ones(theta.shape[0])[m_r_b]
b[m_r_l] = bx(r_p[m_r_l]) * (r_p[m_r_l]/(r[m])[m_r_l] )**2. # the field strength with varying elevation angle (b_x(r_p)(r_p/r)**2)
theta[m_r_l] = thetaX((r[m])[m_r_l] ,(z[m])[m_r_l] ,r_p[m_r_l])
mz = (z[m] == 0.)
theta[mz] = np.pi/2.
BX[0,m] = b * (np.cos(theta) * (z[m] >= 0) + np.cos(pi*np.ones(theta.shape[0]) - theta) * (z[m] < 0))
BX[2,m] = b * (np.sin(theta) * (z[m] >= 0) + np.sin(pi*np.ones(theta.shape[0]) - theta) * (z[m] < 0))
return BX, np.sqrt(np.sum(BX**2.,axis=0))
then, I create three arrays, one for r, one for phi, one for z. Each of these arrays has (e.g: thousand elements). like this:
import gmf
gmfm = gmf.GMF()
x = np.linspace(-20.,20.,100)
y = np.linspace(-20.,20.,100)
z = np.linspace(-1.,1.,x.shape[0])
xx,yy = np.meshgrid(x,y)
rr = np.sqrt(xx**2. + yy**2.)
theta = np.arctan2(yy,xx)
for i,r in enumerate(rr[:]):
Bdisk, Babs_d = gmfm.Bdisk(r,theta[i],z)
Bhalo, Babs_h = gmfm.Bhalo(r,z)
BX, Babs_x = gmfm.BX(r,z)
Btotal = Bdisk + Bhalo + BX
but I am getting when I make the addition of the three functions Btotal= Bdisk + Bhalo+BX) in 2d matrix with 3 rows and 100 columns.
My question is how can I add these three functions together to get Btotal in shape (n,) e.g( shape(100,)
because as I said in the beginning the three functions accept accept arrays (e.g. shape (1000) )then when we adding the three functions together we have to get the total also in the same shape (shape (n,)?
I do not know how can I do it, could you please tell me how can I make it.
thank you for your cooperation.
You need to correct the indention, for example in the def Bdisk method.
More importantly in
for i,r in enumerate(rr[:]):
Bdisk, Babs_d = gmfm.Bdisk(r,theta[i],z)
Bhalo, Babs_h = gmfm.Bhalo(r,z)
BX, Babs_x = gmfm.BX(r,z)
Btotal = Bdisk + Bhalo + BX
are you doing this addition for each iteration, or once at the end of the loop? You aren't accumulating any values over iterations. You are just throwing away the old ones, leaving you with the final iteration.
As for adding the array - it appears that all your arrays are initialed like:
Bdisk = np.zeros((3,r.shape[0]))
If that's what the method returns, then
Bdisk + Bhalo + BX
will just sum the corresponding elements of each array, resulting in a Btotal with the same shape. If you don't not like the shape of Btotal then change how Bdisk is calculated, because it has the same shape.
I have been having trouble finding a solution to my problem.
I have a very large csv file containing multiple points. I have created two different functions that compute both distance and speed.
What I need is a way to carry out these functions between the first and second point, then the second and third point, and so on. I have been toying with using arrays and numpy, but I can not seem to figure it out.
Here are my functions:
# distance
def haversine_distance(lat1, long1, lat2, long2):
degrees_to_radians = math.pi/180.0
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - lat2)*degrees_to_radians
theta1 = long1 * degrees_to_radians
theta2 = long2 * degrees_to_radians
cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) +
math.cos(phi1)*math.cos(phi2))
arc = math.acos(cos) * 6371
arc = arc * 1000
return arc
# speed
def speed(lat1, long1, time1, lat2, long2, time2):
distance = haversine_distance(lat1, long1, lat2, long2)
delta_time = time2 - time1
speed = (distance / delta_time)
speed = speed * 3.6
return speed
I assume you can read your csv data into numpy arrays. Let's call them capital Lat,Long and Time. Then all you need to do is to call your functions over the appropriate points:
# initialize correct vectors
l1=Lat[:-1] # all points but last
l2=Lat[1:] # all points but first
lg1=Long[:-1]
lg2=Long[1:]
t1=Time[:-1]
t2=Time[1:]
speed(l1,lg1,t1,l2,lg2,t2) # call the function which will run over your arrays