I have the following two dataframes. Call this df1
City Latitude Longitude
0 NewYorkCity 40.7128 74.0060
1 Chicago 41.8781 87.6298
2 LA 34.0522 118.2437
3 Paris 48.8566 2.3522
and call this one df2
Place Latitude Longitude
0 75631 26.78436 -80.103
1 89210 26,75347 -80.0192
I want to know how I can calculate the distance between place and all cities listed. So it should look something like this.
Place Latitude Longitude NewYorkCity Chicago Paris
0 75631 26.78436 -80.103 some number ..... ....
1 89210 26,75347 -80.0192 some number .... ....
I'm reading through this particular post and attempting to adapt:Pandas Latitude-Longitude to distance between successive rows
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df['dist'] = haversine(df1.Latitude, df.Longitude, df2.Latitude, df2.Longitude)
I know this looks wrong. Am I needing a for loop to go through each of the ones in df1?
a=df.iloc[:,1::].values#Array of the Lat/Long
b=df2.iloc[:,1::].values##Array of the Lat/Long
df.join(pd.DataFrame(distance.cdist(a, b, 'euclidean')).rename(columns={0:75631,1:89210}))
City Latitude Longitude 75631 89210
0 NewYorkCity 40.7128 74.0060 154.737149 154.656475
1 Chicago 41.8781 87.6298 168.410550 168.329860
2 LA 34.0522 118.2437 198.479810 198.397200
3 Paris 48.8566 2.3522 85.358326 85.285379
Alternatively and which is a long way
df2.rename(columns={'Latitude':'Lat','Longitude':'Long'}, inplace=True)#rename Lat/long in df2
g=pd.concat([df,df2.iloc[:1]], axis=1).fillna(method='ffill')#Append 1st Place on df
h=h=pd.concat([df,df2.iloc[1:]], axis=1).ffill().bfill()#append 2nd place on df
l=g.append(h)#new dataframe
#Compute diatnce
u=l.Latitude.sub(l.Lat)
v=l.Longitude.sub(l.Long)
l['dist'] = np.sqrt(u**2+v**2)
print(l)
City Latitude Longitude Place Lat Long dist
0 NewYorkCity 40.7128 74.0060 75631.0 26.78436 -80.1030 154.737149
1 Chicago 41.8781 87.6298 75631.0 26.78436 -80.1030 168.410550
2 LA 34.0522 118.2437 75631.0 26.78436 -80.1030 198.479810
3 Paris 48.8566 2.3522 75631.0 26.78436 -80.1030 85.358326
0 NewYorkCity 40.7128 74.0060 89210.0 26.75347 -80.0192 154.656475
1 Chicago 41.8781 87.6298 89210.0 26.75347 -80.0192 168.329860
2 LA 34.0522 118.2437 89210.0 26.75347 -80.0192 198.397200
3 Paris 48.8566 2.3522 89210.0 26.75347 -80.0192 85.285379
The following code worked for me:
a=list(range(19))
for i in a:
Lat1=df1[i,2] #works down 3rd column
Lon1=df1[i,3] #works down 4th column
Lat2=df2['Latitude']
Lon2= df2['Longitude']
#the i in the below piece works down the 1st column to grab names
#the code then places them into column names
df2[df1iloc[i,0]] = 3958.756*np.arccos(np.cos(math.radians(90-Lat1)) *np.cos(np.radians(90-Lat2)) +np.sin(math.radians(90-Lat1)) *np.sin(np.radians(90-Lat2)) *np.cos(np.radians(Lon1-Lon2)))
Note that this calculates the miles between each location as direct shots there. Doesn't factor in twists and turns.
Related
I am working on a data frame that looks like this :
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
I'm trying to make a Haverisne distance matrix. Basically for each zone, I would like to calculate the distance between it and all the others in the dataframe. So there should be only 0s on the diagonal. Here is the Haversine function that I use but I can't make my matrix.
def haversine(x):
x.lon, x.lat, x.lon2, x.lat2 = map(radians, [x.lon, x.lat, x.lon2, x.lat2])
# formule de Haversine
dlon = x.lon2 - x.lon
dlat = x.lat2 - x.lat
a = sin(dlat / 2) ** 2 + cos(x.lat) * cos(x.lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
km = 6367 * c
return km
You can use the solution to this answer Pandas - Creating Difference Matrix from Data Frame
Or in your specific case, where you have a DataFrame like this example:
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
And your function is defined as:
def haversine(first, second):
# convert decimal degrees to radians
lat, lon, lat2, lon2 = map(np.radians, [first[0], first[1], second[0], second[1]])
# haversine formula
dlon = lon2 - lon
dlat = lat2 - lat
a = np.sin(dlat/2)**2 + np.cos(lat) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
Where you pass the lat and lon of the first location and the second location.
You can then create a distance matrix using Numpy and then replace the zeros with the distance results from the haversine function:
# create a matrix for the distances between each pair of zones
distances = np.zeros((len(df), len(df)))
for i in range(len(df)):
for j in range(len(df)):
distances[i, j] = haversine(df.iloc[i], df.iloc[j])
pd.DataFrame(distances, index=df.index, columns=df.index)
Your output should be similar to this:
id_zone 0 1 2 3 4
id_zone
0 0.000000 659.422944 589.599339 630.083979 627.383858
1 659.422944 0.000000 171.597296 29.555376 37.325316
2 589.599339 171.597296 0.000000 161.731366 174.983855
3 630.083979 29.555376 161.731366 0.000000 15.474533
4 627.383858 37.325316 174.983855 15.474533 0.000000
I am trying something that could be a little hard to understand but i will try to be very specific.
I have a dataframe of python like this
Locality
Count
Lat.
Long.
Krasnodar
Russia
44
39
Tirana
Albania
41.33
19.83
Areni
Armenia
39.73
45.2
Kars
Armenia
40.604517
43.100758
Brunn Wolfholz
Austria
48.120396
16.291722
Kleinhadersdorf Flur Marchleiten
Austria
48.663197
16.589687
Jalilabad district
Azerbaijan
39.3607139
48.4613556
Zeyem Chaj
Azerbaijan
40.9418889
45.8327778
Jalilabad district
Azerbaijan
39.5186111
48.65
And a dataframe cities.txt with a the name of some countries:
Albania
Armenia
Austria
Azerbaijan
And so on.
The nex what I am doing is convert this Lat. and Long. values as radians and then with the values from the list do something like:
with open('cities.txt') as file:
lines=file.readlines()
x=np.where(df['Count'].eq(lines),pd.DataFrame(
dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373,
columns=df.Locality.unique(), index=df.Locality.unique()))
Where pd.DataFrame(dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373, columns=df.Locality.unique(), index=df.Locality.unique()) is converting radians in Lat. Long. into distances in km and create a dataframe as a matrix for each line (country).
In the end i will have a lot of matrix 2d (in theory) grouped by countries and i want to apply this:
>>>Russia.min()
0
>>>Russia.max()
5
to get the .min() and .max() value in each matrix and save this results in cities.txt as
Country Max.Dist. Min. Dist.
Albania 5 1
Armenia 10 9
Austria 5 3
Azerbaijan 0 0
Unfortunately, 1) I'm stock in the first part where I have an warning ValueError: Lengths must be equal, 2) can be possible have this matrix grouped by country and 3) save my .min() and .max() values?
I am not sure what you exactly want as minimum. In this solution, the minimum is 0 if there is only 1 city, but otherwise the shortest distance between 2 cities within the country. Also, the filename cities.txt seems just a filter. I didn't do this but seems straightforward.
import numpy as np
import pandas as pd
Here just some sample data;
cities = pd.read_json("https://raw.githubusercontent.com/lutangar/cities.json/master/cities.json")
cities = cities.sample(10000)
Create and apply a custom aggregate for groupby()
from sklearn.metrics import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
country_groups = cities.groupby("country")
def city_distances(group):
geo = group[['lat','lng']]
EARTH_RADIUS = 6371
haversine_distances = dist.pairwise(np.radians(geo) )
haversine_distances *= EARTH_RADIUS
distances = {}
distances['max'] = np.max(haversine_distances)
distances['min'] = 0
if len(haversine_distances[ np.nonzero(haversine_distances)] ) > 0 :
distances['min'] = np.min( haversine_distances[ np.nonzero(haversine_distances)] )
return pd.Series(distances)
country_groups.apply(city_distances)
In my case this prints something like
max min
country
AE 323.288482 323.288482
AF 1130.966661 15.435642
AI 12.056890 12.056890
AL 272.300688 3.437074
AM 268.051071 1.328605
... ... ...
YE 662.412344 19.103222
YT 3.723376 3.723376
ZA 1466.334609 24.319334
ZM 1227.429001 218.566369
ZW 503.562608 26.316902
[194 rows x 2 columns]
I have a pandas dataframe that represents the GPS trajectory of a vehicle
d1 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'longitude': [4.929783, 4.932333, 4.933950, 4.933900, 4.928467, 4.924583, 4.922133, 4.921400, 4.920967], 'latitude': [52.372250, 52.370884, 52.371101, 52.372234, 52.375282, 52.375950, 52.376301, 52.376232, 52.374481]}
df1 = pd.DataFrame(data=d1)
id longitude latitude
1 4.929783 52.372250
2 4.932333 52.370884
3 4.933950 52.371101
4 4.933900 52.372234
5 4.928467 52.375282
6 4.924583 52.375950
7 4.922133 52.376301
8 4.921400 52.376232
9 4.920967 52.374481
I already calculated the (haversine) distance in meters between consecutive GPS points as follows:
import numpy as np
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
df1['distance'] = haversine(df1['latitude'], df1['longitude'],
df1['latitude'].shift(), df1['longitude'].shift())
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
3 4.933950 52.371101 112.398101
4 4.933900 52.372234 126.029572
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
8 4.921400 52.376232 50.345227
9 4.920967 52.374481 196.908503
Now I would like to create a function that
removes the second, i.e. the following point if the distance between consecutive GPS points is less than 150 meters.
always keep the last (and the first) GPS point, regardless of the distance between the previous kept feature
Meaning this should be the output:
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
9 4.920967 52.374481 196.908503
What is the best way to achieve this in python?
NOTE: This doesn't account for maximum distance... that would require some look ahead and optimization.
I would iterate through and pass back just the index values of the rows you'd like to keep. Use those index values in a loc call.
Distance
Use whatever metric you want. I used OP's haversine distance.
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
def dis(t0, t1):
lat0 = t0.latitude
lon0 = t0.longitude
lat1 = t1.latitude
lon1 = t1.longitude
return haversine(lat0, lon0, lat1, lon1)
The Loop
def f(d, threshold=50):
itups = d.itertuples()
last = next(itups)
indices = [last.Index]
distances = [0]
for tup in itups:
distance = dis(tup, last)
if distance > threshold:
indices.append(tup.Index)
distances.append(distance)
last = tup
return indices, distances
The Results
idx, distances = f(df1, 150)
df1.loc[idx].assign(distance=distances)
id longitude latitude distance
0 1 4.929783 52.372250 0.000000
1 2 4.932333 52.370884 230.305288
3 4 4.933900 52.372234 183.986479
4 5 4.928467 52.375282 500.896578
5 6 4.924583 52.375950 273.918990
6 7 4.922133 52.376301 170.828592
8 9 4.920967 52.374481 217.302775
I would like to know what is the distance of the nearest place in dataframe two to each of the rows in dataframe one.(What is the nearest place in distance for each coordinates in my dataframe one)
LOOK BELOW ALL MY CODE
I have two Dataframes: (In the original DataFrame I have thousands of rows)
The DataFrame 1 called "place_locations" :*
|CLUSTER| |CLIENT| |LATITUDE| |LENGHT|
0 X1 19.45685402 -70.68645898
1 X1 19.39320504 -70.52567322
2 X1 18.614736 -68.71711383
3 X2 18.47977644 -69.93177289
4 X2 19.76546997 -70.51085451
5 X3 18.55835346 -68.38226906
6 X3 19.79037017 -70.68748243
7 X4 19.2232559 -70.52629188
8 X4 18.42865751 -68.9703434
9 X5 19.37935119 -70.51440314
10 X5 18.68743273 -68.45068029
11 X6 19.44126162 -70.73159162
12 X6 19.6678557 -70.36758867
13 X7 18.7816069 -70.2598325
14 X8 19.48708304 -70.74375908
15 X8 18.93720371 -70.40746487
16 X9 19.299298 -69.5559162
17 X10 18.60044506 -68.41991221
18 X10 19.30702896 -69.54500792
19 X11 19.3783253 -70.618205
The DataFrame 2 called "Coordinates_coords" :
| PLACE| | LATITUDE| | LENGHT|
supermarket 18.63609095 -68.39650565
school 19.44512055 -70.66851055
restarant 18.48377033 -69.93910793
spa 18.46608496 -69.92713481
supermarket 18.45646778 -69.9395694
restaurant 18.4845644 -69.9300583
school 18.47284417 -69.9345797
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6371 * c
return km
def Top_nearest(distancia,distancias,todos = False,limit= 1.0):
results = []
for d in distancias:
results.append(haversine_np(distancia[0],distancia[1],d[0],d[1]))
results= np.array(results)
if not todos:
print(results.argmin())
indexes = np.where(results < limit)
else:
indexes = np.where(results>= 0)
return list(indexes[0]),results[indexes]
nearest_coordinates = list()
for index,row in place_locations.iterrows():
indexes,distances=Top_nearest(row[['LATITUDE', 'LENGHT']].values,
Coordinates_coords[['LATITUDE', 'LENGHT']].reset_index(drop=True).values,
todos=True)
nearest_coordinates.append(distances[0])
nearest_coordinates [:5]
place_locations['Distance_locations'] = nearest_coordinates
place_locations
The results that Im getting are not the correct, there are something in the calculus that I cant identify. The Distance_location column that Im getting, dont give me the nearest distance location
I've previously posted on this. location, distance, nearest
It's simplest to use a library to calculate distances. geopy has worked well for me
import geopy.distance
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO("""CLUSTER CLIENT LATITUDE LENGHT
0 X1 19.45685402 -70.68645898
1 X1 19.39320504 -70.52567322
2 X1 18.614736 -68.71711383
3 X2 18.47977644 -69.93177289
4 X2 19.76546997 -70.51085451
5 X3 18.55835346 -68.38226906
6 X3 19.79037017 -70.68748243
7 X4 19.2232559 -70.52629188
8 X4 18.42865751 -68.9703434
9 X5 19.37935119 -70.51440314
10 X5 18.68743273 -68.45068029
11 X6 19.44126162 -70.73159162
12 X6 19.6678557 -70.36758867
13 X7 18.7816069 -70.2598325
14 X8 19.48708304 -70.74375908
15 X8 18.93720371 -70.40746487
16 X9 19.299298 -69.5559162
17 X10 18.60044506 -68.41991221
18 X10 19.30702896 -69.54500792
19 X11 19.3783253 -70.618205"""), sep="\s+")
df2 = pd.read_csv(io.StringIO(""" PLACE LATITUDE LENGHT
supermarket 18.63609095 -68.39650565
school 19.44512055 -70.66851055
restarant 18.48377033 -69.93910793
spa 18.46608496 -69.92713481
supermarket 18.45646778 -69.9395694
restaurant 18.4845644 -69.9300583
school 18.47284417 -69.9345797"""), sep="\s+")
# no need to calc distance in miles and kms, there for reference
df3 = (df1
.assign(foo=1)
.merge(df2.assign(foo=1), on="foo")
.assign(distance_km=lambda dfa: dfa.apply(lambda r:
geopy.distance.geodesic(
(r["LATITUDE_x"],r["LENGHT_y"]),
(r["LATITUDE_y"],r["LENGHT_y"])).km, axis=1))
.assign(distance_miles=lambda dfa: dfa.apply(lambda r:
geopy.distance.geodesic(
(r["LATITUDE_x"],r["LENGHT_y"]),
(r["LATITUDE_y"],r["LENGHT_y"])).miles, axis=1))
)
# now find nearest PLACE to a CLIENT and count
(df3.sort_values(["CLIENT","distance_km"])
.groupby(["CLIENT"]).agg({"PLACE":"first","distance_km":"first"})
.reset_index()
.groupby("PLACE")["CLIENT"].count()
.to_frame().reset_index().sort_values("CLIENT",ascending=False)
)
output
PLACE CLIENT
2 school 5
3 supermarket 4
0 restarant 1
1 restaurant 1
dict_circles = {
'radii': [1, 5, 10, 15],
'feature_group': [1, 5, 10, 15],
'lat_long': {'city1': [lat1, long1],
'city2': [lat2, long2]
}
}
I'm working with the dictionary above and would like to create the following pandas DataFrame by splitting off the lat & long values and then duplicating them to create symmetry in a new DF:
**radii** **feature_group** **lat** **long**
city1 1 1 lat1 long1
city2 1 1 lat2 long2
city1 5 5 lat1 long1
city2 5 5 lat2 long2
city1 10 10 lat1 long1
city2 10 10 lat2 long2
city1 15 15 lat1 long1
city2 15 15 lat2 long2
From what I can tell I will need a recursion [isinstance(data, type) or other..] function to access the list nested within the inner dictionary 'lat_long', and probably also to use 'pd.DataFrame.from_dict()', and maybe a dictionary comprehension. The solution escapes me. If there's a better strategy please advise.
Here is one way to work with dict_circles:
import pandas as pd
# I multiplied each feature group by 10, to distinguish vs radii
dict_circles = {
'radii': [1, 5, 10, 15],
'feature_group': [10, 50, 100, 150],
'lat_long': {'city1': ['lat1', 'long1'],
'city2': ['lat2', 'long2']
}
}
# convert dict_circles (which is a nested dict) to list-of-tuples
tuples = [(city, r, fg, lat, lon)
for r, fg in zip(dict_circles['radii'], dict_circles['feature_group'])
for city, (lat,lon) in dict_circles['lat_long'].items()
]
# the list-of-tuples is compatible with the DataFrame constructor
df = pd.DataFrame(tuples,
columns=('city', 'radii', 'feature_group', 'lat', 'long'))
print(df)
city radii feature_group lat long
0 city1 1 10 lat1 long1
1 city2 1 10 lat2 long2
2 city1 5 50 lat1 long1
3 city2 5 50 lat2 long2
4 city1 10 100 lat1 long1
5 city2 10 100 lat2 long2
6 city1 15 150 lat1 long1
7 city2 15 150 lat2 long2