Distance Matrix Haversine - python

I am working on a data frame that looks like this :
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
I'm trying to make a Haverisne distance matrix. Basically for each zone, I would like to calculate the distance between it and all the others in the dataframe. So there should be only 0s on the diagonal. Here is the Haversine function that I use but I can't make my matrix.
def haversine(x):
x.lon, x.lat, x.lon2, x.lat2 = map(radians, [x.lon, x.lat, x.lon2, x.lat2])
# formule de Haversine
dlon = x.lon2 - x.lon
dlat = x.lat2 - x.lat
a = sin(dlat / 2) ** 2 + cos(x.lat) * cos(x.lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
km = 6367 * c
return km

You can use the solution to this answer Pandas - Creating Difference Matrix from Data Frame
Or in your specific case, where you have a DataFrame like this example:
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
And your function is defined as:
def haversine(first, second):
# convert decimal degrees to radians
lat, lon, lat2, lon2 = map(np.radians, [first[0], first[1], second[0], second[1]])
# haversine formula
dlon = lon2 - lon
dlat = lat2 - lat
a = np.sin(dlat/2)**2 + np.cos(lat) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
Where you pass the lat and lon of the first location and the second location.
You can then create a distance matrix using Numpy and then replace the zeros with the distance results from the haversine function:
# create a matrix for the distances between each pair of zones
distances = np.zeros((len(df), len(df)))
for i in range(len(df)):
for j in range(len(df)):
distances[i, j] = haversine(df.iloc[i], df.iloc[j])
pd.DataFrame(distances, index=df.index, columns=df.index)
Your output should be similar to this:
id_zone 0 1 2 3 4
id_zone
0 0.000000 659.422944 589.599339 630.083979 627.383858
1 659.422944 0.000000 171.597296 29.555376 37.325316
2 589.599339 171.597296 0.000000 161.731366 174.983855
3 630.083979 29.555376 161.731366 0.000000 15.474533
4 627.383858 37.325316 174.983855 15.474533 0.000000

Related

Calculate distance between consecutive GPS points and reduce GPS density based on this distance

I have a pandas dataframe that represents the GPS trajectory of a vehicle
d1 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'longitude': [4.929783, 4.932333, 4.933950, 4.933900, 4.928467, 4.924583, 4.922133, 4.921400, 4.920967], 'latitude': [52.372250, 52.370884, 52.371101, 52.372234, 52.375282, 52.375950, 52.376301, 52.376232, 52.374481]}
df1 = pd.DataFrame(data=d1)
id longitude latitude
1 4.929783 52.372250
2 4.932333 52.370884
3 4.933950 52.371101
4 4.933900 52.372234
5 4.928467 52.375282
6 4.924583 52.375950
7 4.922133 52.376301
8 4.921400 52.376232
9 4.920967 52.374481
I already calculated the (haversine) distance in meters between consecutive GPS points as follows:
import numpy as np
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
df1['distance'] = haversine(df1['latitude'], df1['longitude'],
df1['latitude'].shift(), df1['longitude'].shift())
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
3 4.933950 52.371101 112.398101
4 4.933900 52.372234 126.029572
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
8 4.921400 52.376232 50.345227
9 4.920967 52.374481 196.908503
Now I would like to create a function that
removes the second, i.e. the following point if the distance between consecutive GPS points is less than 150 meters.
always keep the last (and the first) GPS point, regardless of the distance between the previous kept feature
Meaning this should be the output:
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
9 4.920967 52.374481 196.908503
What is the best way to achieve this in python?
NOTE: This doesn't account for maximum distance... that would require some look ahead and optimization.
I would iterate through and pass back just the index values of the rows you'd like to keep. Use those index values in a loc call.
Distance
Use whatever metric you want. I used OP's haversine distance.
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
def dis(t0, t1):
lat0 = t0.latitude
lon0 = t0.longitude
lat1 = t1.latitude
lon1 = t1.longitude
return haversine(lat0, lon0, lat1, lon1)
The Loop
def f(d, threshold=50):
itups = d.itertuples()
last = next(itups)
indices = [last.Index]
distances = [0]
for tup in itups:
distance = dis(tup, last)
if distance > threshold:
indices.append(tup.Index)
distances.append(distance)
last = tup
return indices, distances
The Results
idx, distances = f(df1, 150)
df1.loc[idx].assign(distance=distances)
id longitude latitude distance
0 1 4.929783 52.372250 0.000000
1 2 4.932333 52.370884 230.305288
3 4 4.933900 52.372234 183.986479
4 5 4.928467 52.375282 500.896578
5 6 4.924583 52.375950 273.918990
6 7 4.922133 52.376301 170.828592
8 9 4.920967 52.374481 217.302775

How to search a certain value in series python

I've got a series:p
0 353.267439
1 388.483605
2 0.494685
3 1.347499
4 404.202001
5 6.163468
6 29.782820
7 28.972926
8 2.822725
9 0.000000
10 1.309716
11 1.309716
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 63.199779
17 62.669258
18 0.306850
19 0.000000
20 28.218308
21 32.078732
22 4.394789
23 0.995053
24 236.355502
25 172.802915
26 1.207798
27 0.174134
28 0.706518
29 0.922744
1666374 0.000000
1666375 0.000000
1666376 0.000000
1666377 0.000000
1666378 0.033375
1666379 0.033375
1666380 0.118138
1666381 0.118138
1666382 12.415525
1666383 12.415525
1666384 24.252089
1666385 0.270588
1666386 24.292072
1666387 12.415525
1666388 12.415525
1666389 0.000000
1666390 0.000000
1666391 0.000000
1666392 0.118138
1666393 0.118138
1666394 0.118138
1666395 0.000000
1666396 0.000000
1666397 0.000000
1666398 0.000000
1666399 0.000000
1666400 0.118138
1666401 0.000000
1666402 0.118138
1666403 0.118138
Name: Dis, Length: 1666404, dtype: float64
and I believe there is a value '4.74036126519e-07' in it
I try some methods to find the value:
p[p =='value']
or function:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
but they return nothing
strangely, when I call:
p[p ==0]
it can return the index
I wanna ask why and how to find value in series properly
code:
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
def DisM(df,ID):
df_user=df.loc[df['UserID'] == ID]
p= haversine_np(df_user.Longitude.shift(), df_user.Latitude.shift(), df_user.ix[1:, 'Longitude'], df_user.ix[1:, 'Latitude'])
p=p.iloc[1:]
p=p.rename("Dis")
return (p)
p = DisM(df,1)
for num in np.arange(2,4861):
p= p.append(DisM(df,num))
p=p.reset_index(drop=True)
df is a dataframe contain users' location information (longtitude latitude)
and use haversine to count the distance between their trips
then use a for loop to append together the distance :p
actually the number i try to find is not so important . i cannot get a result from searching other values in the series either like 353.267439 (the first element)
This adds the rounding in you checking function:
def find(s, el, n):
for i in range(len(s)):
if round(s[i],n) == round(el,n):
return i
return None
n is the number of digits the number will be rounded to.
You can test it using a simple script like this one
series = []
with open('series.txt','r') as f:
for line in f:
series.append(line.strip().split())
res = [float(x[1]) for x in series]
check = [353.267,0.706518,24.292]
print [find(res, x, 3) for x in check]
# yields [0, 28, 42]
Where series.txt is a text file with the data you posted (with one removed empty line). The above will print the correct indexes - it mimics the situation where rounding is up to the 3 decimal which is the precision of the input in check - except for the middle element.
Similarly it will work if the values in check have some trailing numbers,
check = [353.2671111,0.7065181111,24.292111]
print [find(res, x, 3) for x in check]
# yields [0, 28, 42]
But it will not - except for the exact one - if you increase the precision past the lowest one,
check = [353.267,0.706518,24.292]
print [find(res, x, 7) for x in check]
# yields [None, 28, None]

Efficient computation of minimum of Haversine distances

I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I'm trying to return the minimum distance between the coordinates in each individual row compared to every coordinate in the list. The following code works on a small scale (dataframe with 200 rows), but when calculating over 2.7MM rows, it seemingly runs forever.
from haversine import haversine
df
Latitude Longitude
39.989 -89.980
39.923 -89.901
39.990 -89.987
39.884 -89.943
39.030 -89.931
end_coords_list = [(41.342,-90.423),(40.349,-91.394),(38.928,-89.323)]
for row in df.itertuples():
def min_distance(row):
beg_coord = (row.Latitude, row.Longitude)
return min(haversine(beg_coord, end_coord) for end_coord in end_coords_list)
df['Min_Distance'] = df.apply(min_distance, axis=1)
I know the issue lies in the sheer number of calculations that are happening (5.7MM * 2,000 = ~11.4BN), and the fact that running this many loops is incredibly inefficient.
Based on my research, it seems like a vectorized NumPy function might be a better approach, but I'm new to Python and NumPy so I'm not quite sure how to implement this in this particular situation.
Ideal Output:
df
Latitude Longitude Min_Distance
39.989 -89.980 3.7
39.923 -89.901 4.1
39.990 -89.987 4.2
39.884 -89.943 5.9
39.030 -89.931 3.1
Thanks in advance!
The haversine func in essence is :
# convert all latitudes/longitudes from decimal degrees to radians
lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))
# calculate haversine
lat = lat2 - lat1
lng = lng2 - lng1
d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
Here's a vectorized method leveraging the powerful NumPy broadcasting and NumPy ufuncs to replace those math-module funcs so that we would operate on entire arrays in one go -
# Get array data; convert to radians to simulate 'map(radians,...)' part
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
# Get the differentiations
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
# Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.
# Add into "sin(lat * 0.5) ** 2" part.
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
# Get h and assign into dataframe
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance'] = h.min(1)
For further performance boost, we can make use of numexpr module to replace the transcendental funcs.
Runtime test and verification
Approaches -
def loopy_app(df, coords_list):
for row in df.itertuples():
df['Min_Distance1'] = df.apply(min_distance, axis=1)
def vectorized_app(df, coords_list):
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance2'] = h.min(1)
Verification -
In [158]: df
Out[158]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [159]: loopy_app(df, coords_list)
In [160]: vectorized_app(df, coords_list)
In [161]: df
Out[161]:
Latitude Longitude Min_Distance1 Min_Distance2
0 39.989 -89.980 126.637607 126.637607
1 39.923 -89.901 121.266241 121.266241
2 39.990 -89.987 126.037388 126.037388
3 39.884 -89.943 118.901195 118.901195
4 39.030 -89.931 53.765506 53.765506
Timings -
In [163]: df
Out[163]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [164]: %timeit loopy_app(df, coords_list)
100 loops, best of 3: 2.41 ms per loop
In [165]: %timeit vectorized_app(df, coords_list)
10000 loops, best of 3: 96.8 µs per loop

How to call data from a dataframe into Haversine function [duplicate]

This question already has an answer here:
Vectorised Haversine formula with a pandas dataframe
(1 answer)
Closed 6 years ago.
I have a dataframe called lat_long which contains the latitude and longitude of some locations. I want to find the difference between each following location. When I use the example haversine function, i get an error. KeyError: ('1', u'occurred at index 0').
1 2
0 -6.081689 145.391881
1 -5.207083 145.788700
2 -5.826789 144.295861
3 -6.569828 146.726242
4 -9.443383 147.220050
def haversine(row):
lon1 = lat_long['1']
lat1 = lat_long['2']
lon2 = row['1']
lat2 = row['2']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km
lat_long['distance'] = lat_long.apply(lambda row: haversine(row), axis=1)
lat_long
Try this solution:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
Demo:
In [17]: df
Out[17]:
lat lon
0 -6.081689 145.391881
1 -5.207083 145.788700
2 -5.826789 144.295861
3 -6.569828 146.726242
4 -9.443383 147.220050
In [18]: df['dist'] = \
...: haversine_np(df.lon.shift(), df.lat.shift(), df.ix[1:, 'lon'], df.ix[1:, 'lat'])
In [19]: df
Out[19]:
lat lon dist
0 -6.081689 145.391881 NaN
1 -5.207083 145.788700 106.638117
2 -5.826789 144.295861 178.907364
3 -6.569828 146.726242 280.904983
4 -9.443383 147.220050 323.913612

Vectorised Haversine formula with a pandas dataframe

I know that to find the distance between two latitude, longitude points I need to use the haversine function:
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
I have a DataFrame where one column is latitude and another column is longitude. I want to find out how far these points are from a set point, -56.7213600, 37.2175900. How do I take the values from the DataFrame and put them into the function?
example DataFrame:
SEAZ LAT LON
1 296.40, 58.7312210, 28.3774110
2 274.72, 56.8148320, 31.2923240
3 192.25, 52.0649880, 35.8018640
4 34.34, 68.8188750, 67.1933670
5 271.05, 56.6699880, 31.6880620
6 131.88, 48.5546220, 49.7827730
7 350.71, 64.7742720, 31.3953780
8 214.44, 53.5192920, 33.8458560
9 1.46, 67.9433740, 38.4842520
10 273.55, 53.3437310, 4.4716664
I can't confirm if the calculations are correct but the following worked:
In [11]:
from numpy import cos, sin, arcsin, sqrt
from math import radians
def haversine(row):
lon1 = -56.7213600
lat1 = 37.2175900
lon2 = row['LON']
lat2 = row['LAT']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km
df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
SEAZ LAT LON distance
index
1 296.40 58.731221 28.377411 6275.791920
2 274.72 56.814832 31.292324 6509.727368
3 192.25 52.064988 35.801864 6990.144378
4 34.34 68.818875 67.193367 7357.221846
5 271.05 56.669988 31.688062 6538.047542
6 131.88 48.554622 49.782773 8036.968198
7 350.71 64.774272 31.395378 6229.733699
8 214.44 53.519292 33.845856 6801.670843
9 1.46 67.943374 38.484252 6418.754323
10 273.55 53.343731 4.471666 4935.394528
The following code is actually slower on such a small dataframe but I applied it to a 100,000 row df:
In [35]:
%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))
1 loops, best of 3: 17.2 ms per loop
Compared to the apply function which took 4.3s so nearly 250 times quicker, something to note in the future
If we compress all the above in to a one-liner:
In [39]:
%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop
We observe further speed ups now a factor of ~341 times quicker.

Categories