How to Convert Asymmetrical Dictionary to a pandas DataFrame - python

dict_circles = {
'radii': [1, 5, 10, 15],
'feature_group': [1, 5, 10, 15],
'lat_long': {'city1': [lat1, long1],
'city2': [lat2, long2]
}
}
I'm working with the dictionary above and would like to create the following pandas DataFrame by splitting off the lat & long values and then duplicating them to create symmetry in a new DF:
**radii** **feature_group** **lat** **long**
city1 1 1 lat1 long1
city2 1 1 lat2 long2
city1 5 5 lat1 long1
city2 5 5 lat2 long2
city1 10 10 lat1 long1
city2 10 10 lat2 long2
city1 15 15 lat1 long1
city2 15 15 lat2 long2
From what I can tell I will need a recursion [isinstance(data, type) or other..] function to access the list nested within the inner dictionary 'lat_long', and probably also to use 'pd.DataFrame.from_dict()', and maybe a dictionary comprehension. The solution escapes me. If there's a better strategy please advise.

Here is one way to work with dict_circles:
import pandas as pd
# I multiplied each feature group by 10, to distinguish vs radii
dict_circles = {
'radii': [1, 5, 10, 15],
'feature_group': [10, 50, 100, 150],
'lat_long': {'city1': ['lat1', 'long1'],
'city2': ['lat2', 'long2']
}
}
# convert dict_circles (which is a nested dict) to list-of-tuples
tuples = [(city, r, fg, lat, lon)
for r, fg in zip(dict_circles['radii'], dict_circles['feature_group'])
for city, (lat,lon) in dict_circles['lat_long'].items()
]
# the list-of-tuples is compatible with the DataFrame constructor
df = pd.DataFrame(tuples,
columns=('city', 'radii', 'feature_group', 'lat', 'long'))
print(df)
city radii feature_group lat long
0 city1 1 10 lat1 long1
1 city2 1 10 lat2 long2
2 city1 5 50 lat1 long1
3 city2 5 50 lat2 long2
4 city1 10 100 lat1 long1
5 city2 10 100 lat2 long2
6 city1 15 150 lat1 long1
7 city2 15 150 lat2 long2

Related

Distance Matrix Haversine

I am working on a data frame that looks like this :
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
I'm trying to make a Haverisne distance matrix. Basically for each zone, I would like to calculate the distance between it and all the others in the dataframe. So there should be only 0s on the diagonal. Here is the Haversine function that I use but I can't make my matrix.
def haversine(x):
x.lon, x.lat, x.lon2, x.lat2 = map(radians, [x.lon, x.lat, x.lon2, x.lat2])
# formule de Haversine
dlon = x.lon2 - x.lon
dlat = x.lat2 - x.lat
a = sin(dlat / 2) ** 2 + cos(x.lat) * cos(x.lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
km = 6367 * c
return km
You can use the solution to this answer Pandas - Creating Difference Matrix from Data Frame
Or in your specific case, where you have a DataFrame like this example:
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
And your function is defined as:
def haversine(first, second):
# convert decimal degrees to radians
lat, lon, lat2, lon2 = map(np.radians, [first[0], first[1], second[0], second[1]])
# haversine formula
dlon = lon2 - lon
dlat = lat2 - lat
a = np.sin(dlat/2)**2 + np.cos(lat) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
Where you pass the lat and lon of the first location and the second location.
You can then create a distance matrix using Numpy and then replace the zeros with the distance results from the haversine function:
# create a matrix for the distances between each pair of zones
distances = np.zeros((len(df), len(df)))
for i in range(len(df)):
for j in range(len(df)):
distances[i, j] = haversine(df.iloc[i], df.iloc[j])
pd.DataFrame(distances, index=df.index, columns=df.index)
Your output should be similar to this:
id_zone 0 1 2 3 4
id_zone
0 0.000000 659.422944 589.599339 630.083979 627.383858
1 659.422944 0.000000 171.597296 29.555376 37.325316
2 589.599339 171.597296 0.000000 161.731366 174.983855
3 630.083979 29.555376 161.731366 0.000000 15.474533
4 627.383858 37.325316 174.983855 15.474533 0.000000

Calculate distance between consecutive GPS points and reduce GPS density based on this distance

I have a pandas dataframe that represents the GPS trajectory of a vehicle
d1 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'longitude': [4.929783, 4.932333, 4.933950, 4.933900, 4.928467, 4.924583, 4.922133, 4.921400, 4.920967], 'latitude': [52.372250, 52.370884, 52.371101, 52.372234, 52.375282, 52.375950, 52.376301, 52.376232, 52.374481]}
df1 = pd.DataFrame(data=d1)
id longitude latitude
1 4.929783 52.372250
2 4.932333 52.370884
3 4.933950 52.371101
4 4.933900 52.372234
5 4.928467 52.375282
6 4.924583 52.375950
7 4.922133 52.376301
8 4.921400 52.376232
9 4.920967 52.374481
I already calculated the (haversine) distance in meters between consecutive GPS points as follows:
import numpy as np
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
df1['distance'] = haversine(df1['latitude'], df1['longitude'],
df1['latitude'].shift(), df1['longitude'].shift())
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
3 4.933950 52.371101 112.398101
4 4.933900 52.372234 126.029572
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
8 4.921400 52.376232 50.345227
9 4.920967 52.374481 196.908503
Now I would like to create a function that
removes the second, i.e. the following point if the distance between consecutive GPS points is less than 150 meters.
always keep the last (and the first) GPS point, regardless of the distance between the previous kept feature
Meaning this should be the output:
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
9 4.920967 52.374481 196.908503
What is the best way to achieve this in python?
NOTE: This doesn't account for maximum distance... that would require some look ahead and optimization.
I would iterate through and pass back just the index values of the rows you'd like to keep. Use those index values in a loc call.
Distance
Use whatever metric you want. I used OP's haversine distance.
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
def dis(t0, t1):
lat0 = t0.latitude
lon0 = t0.longitude
lat1 = t1.latitude
lon1 = t1.longitude
return haversine(lat0, lon0, lat1, lon1)
The Loop
def f(d, threshold=50):
itups = d.itertuples()
last = next(itups)
indices = [last.Index]
distances = [0]
for tup in itups:
distance = dis(tup, last)
if distance > threshold:
indices.append(tup.Index)
distances.append(distance)
last = tup
return indices, distances
The Results
idx, distances = f(df1, 150)
df1.loc[idx].assign(distance=distances)
id longitude latitude distance
0 1 4.929783 52.372250 0.000000
1 2 4.932333 52.370884 230.305288
3 4 4.933900 52.372234 183.986479
4 5 4.928467 52.375282 500.896578
5 6 4.924583 52.375950 273.918990
6 7 4.922133 52.376301 170.828592
8 9 4.920967 52.374481 217.302775

Longitude and Latitude Distance Calculation between 2 dataframes

I have the following two dataframes. Call this df1
City Latitude Longitude
0 NewYorkCity 40.7128 74.0060
1 Chicago 41.8781 87.6298
2 LA 34.0522 118.2437
3 Paris 48.8566 2.3522
and call this one df2
Place Latitude Longitude
0 75631 26.78436 -80.103
1 89210 26,75347 -80.0192
I want to know how I can calculate the distance between place and all cities listed. So it should look something like this.
Place Latitude Longitude NewYorkCity Chicago Paris
0 75631 26.78436 -80.103 some number ..... ....
1 89210 26,75347 -80.0192 some number .... ....
I'm reading through this particular post and attempting to adapt:Pandas Latitude-Longitude to distance between successive rows
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df['dist'] = haversine(df1.Latitude, df.Longitude, df2.Latitude, df2.Longitude)
I know this looks wrong. Am I needing a for loop to go through each of the ones in df1?
a=df.iloc[:,1::].values#Array of the Lat/Long
b=df2.iloc[:,1::].values##Array of the Lat/Long
df.join(pd.DataFrame(distance.cdist(a, b, 'euclidean')).rename(columns={0:75631,1:89210}))
City Latitude Longitude 75631 89210
0 NewYorkCity 40.7128 74.0060 154.737149 154.656475
1 Chicago 41.8781 87.6298 168.410550 168.329860
2 LA 34.0522 118.2437 198.479810 198.397200
3 Paris 48.8566 2.3522 85.358326 85.285379
Alternatively and which is a long way
df2.rename(columns={'Latitude':'Lat','Longitude':'Long'}, inplace=True)#rename Lat/long in df2
g=pd.concat([df,df2.iloc[:1]], axis=1).fillna(method='ffill')#Append 1st Place on df
h=h=pd.concat([df,df2.iloc[1:]], axis=1).ffill().bfill()#append 2nd place on df
l=g.append(h)#new dataframe
#Compute diatnce
u=l.Latitude.sub(l.Lat)
v=l.Longitude.sub(l.Long)
l['dist'] = np.sqrt(u**2+v**2)
print(l)
City Latitude Longitude Place Lat Long dist
0 NewYorkCity 40.7128 74.0060 75631.0 26.78436 -80.1030 154.737149
1 Chicago 41.8781 87.6298 75631.0 26.78436 -80.1030 168.410550
2 LA 34.0522 118.2437 75631.0 26.78436 -80.1030 198.479810
3 Paris 48.8566 2.3522 75631.0 26.78436 -80.1030 85.358326
0 NewYorkCity 40.7128 74.0060 89210.0 26.75347 -80.0192 154.656475
1 Chicago 41.8781 87.6298 89210.0 26.75347 -80.0192 168.329860
2 LA 34.0522 118.2437 89210.0 26.75347 -80.0192 198.397200
3 Paris 48.8566 2.3522 89210.0 26.75347 -80.0192 85.285379
The following code worked for me:
a=list(range(19))
for i in a:
Lat1=df1[i,2] #works down 3rd column
Lon1=df1[i,3] #works down 4th column
Lat2=df2['Latitude']
Lon2= df2['Longitude']
#the i in the below piece works down the 1st column to grab names
#the code then places them into column names
df2[df1iloc[i,0]] = 3958.756*np.arccos(np.cos(math.radians(90-Lat1)) *np.cos(np.radians(90-Lat2)) +np.sin(math.radians(90-Lat1)) *np.sin(np.radians(90-Lat2)) *np.cos(np.radians(Lon1-Lon2)))
Note that this calculates the miles between each location as direct shots there. Doesn't factor in twists and turns.

How to call data from a dataframe into Haversine function [duplicate]

This question already has an answer here:
Vectorised Haversine formula with a pandas dataframe
(1 answer)
Closed 6 years ago.
I have a dataframe called lat_long which contains the latitude and longitude of some locations. I want to find the difference between each following location. When I use the example haversine function, i get an error. KeyError: ('1', u'occurred at index 0').
1 2
0 -6.081689 145.391881
1 -5.207083 145.788700
2 -5.826789 144.295861
3 -6.569828 146.726242
4 -9.443383 147.220050
def haversine(row):
lon1 = lat_long['1']
lat1 = lat_long['2']
lon2 = row['1']
lat2 = row['2']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km
lat_long['distance'] = lat_long.apply(lambda row: haversine(row), axis=1)
lat_long
Try this solution:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
Demo:
In [17]: df
Out[17]:
lat lon
0 -6.081689 145.391881
1 -5.207083 145.788700
2 -5.826789 144.295861
3 -6.569828 146.726242
4 -9.443383 147.220050
In [18]: df['dist'] = \
...: haversine_np(df.lon.shift(), df.lat.shift(), df.ix[1:, 'lon'], df.ix[1:, 'lat'])
In [19]: df
Out[19]:
lat lon dist
0 -6.081689 145.391881 NaN
1 -5.207083 145.788700 106.638117
2 -5.826789 144.295861 178.907364
3 -6.569828 146.726242 280.904983
4 -9.443383 147.220050 323.913612

Vectorised Haversine formula with a pandas dataframe

I know that to find the distance between two latitude, longitude points I need to use the haversine function:
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
I have a DataFrame where one column is latitude and another column is longitude. I want to find out how far these points are from a set point, -56.7213600, 37.2175900. How do I take the values from the DataFrame and put them into the function?
example DataFrame:
SEAZ LAT LON
1 296.40, 58.7312210, 28.3774110
2 274.72, 56.8148320, 31.2923240
3 192.25, 52.0649880, 35.8018640
4 34.34, 68.8188750, 67.1933670
5 271.05, 56.6699880, 31.6880620
6 131.88, 48.5546220, 49.7827730
7 350.71, 64.7742720, 31.3953780
8 214.44, 53.5192920, 33.8458560
9 1.46, 67.9433740, 38.4842520
10 273.55, 53.3437310, 4.4716664
I can't confirm if the calculations are correct but the following worked:
In [11]:
from numpy import cos, sin, arcsin, sqrt
from math import radians
def haversine(row):
lon1 = -56.7213600
lat1 = 37.2175900
lon2 = row['LON']
lat2 = row['LAT']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km
df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
SEAZ LAT LON distance
index
1 296.40 58.731221 28.377411 6275.791920
2 274.72 56.814832 31.292324 6509.727368
3 192.25 52.064988 35.801864 6990.144378
4 34.34 68.818875 67.193367 7357.221846
5 271.05 56.669988 31.688062 6538.047542
6 131.88 48.554622 49.782773 8036.968198
7 350.71 64.774272 31.395378 6229.733699
8 214.44 53.519292 33.845856 6801.670843
9 1.46 67.943374 38.484252 6418.754323
10 273.55 53.343731 4.471666 4935.394528
The following code is actually slower on such a small dataframe but I applied it to a 100,000 row df:
In [35]:
%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))
1 loops, best of 3: 17.2 ms per loop
Compared to the apply function which took 4.3s so nearly 250 times quicker, something to note in the future
If we compress all the above in to a one-liner:
In [39]:
%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop
We observe further speed ups now a factor of ~341 times quicker.

Categories