I'm trying to find a solution to the nearest neighbour algorithm.
I have a list o stores (about 25000) with coordinates and a list of exagon that belong to different bordered cities i cover with deliveries described by centroid coordinates and and hexagon polygon.
I need to tell to what hexagon each address belong,
I can find the brute force solution but it need about 2 days of calculation and I am interested in finding a fast solution to run frequently if city coverages have to change.
data of stores:
INDIRIZZO_COMPLETO latitude longitude COORDINATE_EXTRACTION_DETAIL
0 LUNGOMARE LUIGI RIZZO 1, 92010 LAMPEDUSA E LIN... 35.497965 12.607482 from original address
1 VIA TERRANOVA 71, 92010 LAMPEDUSA E LINOSA (AG... 35.506421 12.610504 from original address
2 VIALE PAPA PIO XII 107/109, 00036 PALESTRINA (... 35.551062 12.320357 from zipcode: 36 Italy
3 VIA ROMA 82, 96010 PORTOPALO DI CAPO PASSERO (... 36.682967 15.133651 from original address
4 CONTRADA PIANETTI SNC, 96018 PACHINO (SR), SIC... 36.700497 15.073600 from zipcode: 96018 Italy
data of exagon:
city_code Polygon latitude longitude
0 SCN POLYGON ((10.63303663611384 44.59771368472511,... 44.597003 10.635361
1 SCN POLYGON ((10.706225086720105 44.58751732975397... 44.586805 10.708550
2 BAR POLYGON ((16.939176495419776 41.09659615583256... 41.095711 16.941403
3 BAR POLYGON ((16.925717571722554 41.10755391076213... 41.106669 16.927944
4 BAR POLYGON ((16.89992580762363 41.067339007464646... 41.066454 16.902151
I implemented a solution using this:
tree = BallTree(np.deg2rad(df[['latitude', 'longitude']].values), metric='haversine')
distances, indices = tree.query(np.deg2rad(np.c_[query_lats, query_lons]), k = 5)
r_km = 6371 # multiplier to convert to km (from unit distance)
for name, d, ind in zip(df_other['INDIRIZZO_COMPLETO'], distances, indices):
print(f"INDIRIZZO_COMPLETO {name} closest matches:")
for i, index in enumerate(ind):
print(f"\t{df['city_code'][index]} with distance {d[i]*r_km:.4f} km")
list_data = (name, df['city_code'][index], d[i]*r_km)
append_list_as_row(file_name_2, list_data)
with quite good response for some zone and totally wrong a lot of others
any suggestion?
Use geopandas.sjoin to efficiently assign points to polygons:
gdf = geopandas.GeoDataFrame(
df,
geometry=geopandas.points_from_xy(
df.longitude, df.latitude
),
)
joined = geopandas.sjoin(
gdf,
other_df,
how="left",
predicate="intersects",
)
I have converted a 200m x 200m point grid of Greater London into a multypolygon 500m radius buffer layer for each point in the grid. What this means is that I have over 100,000 overlapping polygons.
I also have a years worth of crime data as a point layer with lat longs (over 1.1million crimes x 12 columns of data)
I am trying to find the most efficient way to count the number of crime points in each polygon buffer. As the polygon buffers are overlapping the crime points will overlap too for all of the buffers.
The spatial join in geopandas doesn't seem to work, maybe because the polygons are overlapping? If I use "inner" join I just get a blank dataframe back. If I use "left" join then I just get all the crime rows (1.1million) with the buffer polygon columns to the right all as "nan". And vice versa if I use "right" join - just the buffer rows (100,000) with crime columns as nan. See the code below:
import pandas as pd
import geopandas as gpd
from geopandas import read_file
from pandas import read_csv
from geopandas import GeoDataFrame, points_from_xy
#import buffer polygon layer
gBuffer = read_file('London Buffer.zip')
df1 = gBuffer.head()
#import crime csv
crime = read_csv('2020-2021 London Crime.csv')
#drop nan rows from coords
crime2 = crime[crime['Longitude'].notna()]
df2 = crime2.head()
#geocode crime points
gCrime = GeoDataFrame(crime2, geometry=points_from_xy(crime2['Longitude'], crime2['Latitude']))
df3 = gCrime.head()
#set equal crs
gCrime.crs = gBuffer.crs
#spatial join data
BufferCrime = gpd.sjoin(gCrime, gBuffer, how="inner")
The other solution is to iterate over each polygon and count the number of points but this will take forever given that it has to do 100,000 x 1,100,000 iterations
# Loop over polygons with index i.
for i, poly in gBuffer.iterrows():
#list of points in this poly
pts_in_this_poly = []
#loop over all points
for j, pt in gCrime.iterrows():
if poly.geometry.contains(pt.geometry):
# Add it to the list
pts_in_this_poly.append(pt.geometry)
pts_in_polys.append(len(pts_in_this_poly))
#Add the points
gBuffer['number of Crime points'] = gpd.GeoSeries(pts_in_polys)
Any ideas what would be the best way to solve this problem?
I have a list of points (longitude and latitude), as well as their associated point geometries in a geodataframe. All of the points should be able to be subdivided into individual polygons, as the points are generally clustered in several areas. What I would like to do is have some sort of algorithm that loops over the points and checks the the distance between the previous and current point. If the distance is sufficiently small, it would group those points together. This process would continue until the current point is too far away. It would make a polygon out of those close points, and then continue the process with the next group of points.
gdf
longitude latitude geometry
0 -76.575249 21.157229 POINT (-76.57525 21.15723)
1 -76.575035 21.157453 POINT (-76.57503 21.15745)
2 -76.575255 21.157678 POINT (-76.57526 21.15768)
3 -76.575470 21.157454 POINT (-76.57547 21.15745)
5 -112.973177 31.317333 POINT (-112.97318 31.31733)
... ... ... ...
2222 -113.492501 47.645914 POINT (-113.49250 47.64591)
2223 -113.492996 47.643609 POINT (-113.49300 47.64361)
2225 -113.492379 47.643557 POINT (-113.49238 47.64356)
2227 -113.487443 47.643142 POINT (-113.48744 47.64314)
2230 -105.022627 48.585669 POINT (-105.02263 48.58567)
So in the data above, the first 4 points would be grouped together and turned into a polygon. Then, it would move onto the next group, and so forth. Each group of points is not evenly spaced, i.e., the next group might be 7 pairs of points, and the following could be 3. Ideally, the final output would be another geodataframe that is just a bunch of polygons.
You can try DBSCAN clustering as it will automatically find the best number of clusters and you can specify a maximum distance between points ( ε ).
Using your example, the algorithm identifies two clusters.
import pandas as pd
from sklearn.cluster import DBSCAN
df = pd.DataFrame(
[
[-76.575249, 21.157229, (-76., 21.15723)],
[-76.575035, 21.157453, (-76.57503, 21.15745)],
[-76.575255, 21.157678, (-76.57526, 21.15768)],
[-76.575470, 21.157454, (-76.57547, 21.15745)],
[-112.973177, 31.317333, (-112.97318, 31.31733)],
[-113.492501, 47.645914, (-113.49250, 47.64591)],
[-113.492996, 47.643609, (-113.49300, 47.64361)],
[-113.492379, 47.643557, (-113.49238, 47.64356)],
[-113.487443, 47.643142, (-113.48744, 47.64314)],
[-105.022627, 48.585669, (-105.02263, 48.58567)]
], columns=["longitude", "latitude", "geometry"])
clustering = DBSCAN(eps=0.3, min_samples=4).fit(df[['longitude','latitude']].values)
gdf = pd.concat([df, pd.Series(clustering.labels_, name='label')], axis=1)
print(gdf)
gdf.plot.scatter(x='longitude', y='latitude', c='label')
longitude latitude geometry label
0 -76.575249 21.157229 (-76.0, 21.15723) 0
1 -76.575035 21.157453 (-76.57503, 21.15745) 0
2 -76.575255 21.157678 (-76.57526, 21.15768) 0
3 -76.575470 21.157454 (-76.57547, 21.15745) 0
4 -112.973177 31.317333 (-112.97318, 31.31733) -1 # not in cluster
5 -113.492501 47.645914 (-113.4925, 47.64591) 1
6 -113.492996 47.643609 (-113.493, 47.64361) 1
7 -113.492379 47.643557 (-113.49238, 47.64356) 1
8 -113.487443 47.643142 (-113.48744, 47.64314) 1
9 -105.022627 48.585669 (-105.02263, 48.58567) -1 # not in cluster
If we add random data to your data set, run the clustering algorithm, and filter out those data points not in clusters, you get a clearer idea of how it's working.
import numpy as np
rng = np.random.default_rng(seed=42)
arr2 = pd.DataFrame(rng.random((3000, 2)) * 100, columns=['latitude', 'longitude'])
randdf = pd.concat([df[['latitude', 'longitude']], arr2]).reset_index()
clustering = DBSCAN(eps=1, min_samples=4).fit(randdf[['longitude','latitude']].values)
labels = pd.Series(clustering.labels_, name='label')
gdf = pd.concat([randdf[['latitude', 'longitude']], labels], axis=1)
subgdf = gdf[gdf['label']> -1]
subgdf.plot.scatter(x='longitude', y='latitude', c='label', colormap='viridis', figsize=(20,10))
print(gdf['label'].value_counts())
-1 2527
16 10
3 8
10 8
50 8
...
57 4
64 4
61 4
17 4
0 4
Name: label, Length: 99, dtype: int64
Getting the clustered points from this dataframe would be relatively simple. Something like this:
subgdf['point'] = subgdf.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
subgdf.groupby(['label'])['point'].apply(list)
label
0 [(21.157229, -76.575249), (21.157453, -76.5750...
1 [(47.645914, -113.492501), (47.643609, -113.49...
2 [(46.67210037270342, 4.380376578722878), (46.5...
3 [(85.34030732681661, 23.393948586534073), (86....
4 [(81.40203846660347, 16.697291990770392), (82....
...
93 [(61.419880354359925, 23.25522624430636), (61....
94 [(50.893415175135424, 90.70863269095085), (52....
95 [(88.80586950148697, 81.17523712192651), (88.6...
96 [(34.23624333000541, 40.8156668231013), (35.86...
97 [(16.10456828199399, 67.41443008931344), (15.9...
Name: point, Length: 98, dtype: object
Although you'd probably need to do some kind of sorting to make sure you were connecting the closest points when drawing the polygons.
Similar SO question
DBSCAN from sklearn
Haversine Formula in Python (Bearing and Distance between two GPS points)
https://gis.stackexchange.com/questions/121256/creating-a-circle-with-radius-in-metres
You may be able to use the haversine formula to group points within a distance. Create polygons for each point (function below) with the formula then filter points inside from the master list of points and repeat until there are no more points.
#import modules
import numpy as np
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, GeoSeries
from shapely import geometry
from shapely.geometry import Polygon, Point
from functools import partial
import pyproj
from shapely.ops import transform
#function to create polygons on radius
def polycir(lat, lon, radius):
local_azimuthal_projection = """+proj=aeqd +R=6371000 +units=m +lat_0={} +lon_0=
{}""".format(lat, lon)
wgs84_to_aeqd = partial(
pyproj.transform,
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
pyproj.Proj(local_azimuthal_projection),
)
aeqd_to_wgs84 = partial(
pyproj.transform,
pyproj.Proj(local_azimuthal_projection),
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
)
center = Point(float(lon), float(lat))
point_transformed = transform(wgs84_to_aeqd, center)
buffer = point_transformed.buffer(radius)
# Get the polygon with lat lon coordinates
circle_poly = transform(aeqd_to_wgs84, buffer)
return circle_poly
#Convert df to gdf
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
#Create circle polygons col
gdf['polycir'] = [polycir(x, y, <'Radius in Meters'>) for x, y in zip(gdf.latitude,
gdf.longitude)]
gdf.set_geometry('polycir', inplace=True)
#You should be able to loop through the polygons and find the geometries that overlap with
# gdf_filtered = gdf[gdf.polycir.within(gdf.iloc[0,4])]
Looks like a job for k-means clustering.
You may need to be careful to how you define your distance (actual disctance "through" earth, or shortest path around?)
Turning each cluster into a polygon depends on what you want to do... just chain them or look for their convex enveloppe...
Using pandas and geopandas, I would like to define a function to be applied to each row of a dataframe which operates as follows:
INPUT: column with coordinates
OUTPUT: zone in which the point falls.
I tried with this, but it takes very long.
def zone_assign(point,zones,codes):
try:
zone_label=zones[zones['geometry'].contains(point)][codes].values[0]
except:
zone_label=np.NaN
return(zone_label)
where:
point is the cell of the row which contains geographical coordinates;
zones is the shapefile imported with geopandas;
codes is the column of the shapefile which contains label to be assigned to the point.
Part of the answer, is taken from another answer I made earlier that needed within rather than contains
Your situation looks like a typical case where spatial joins are useful. The idea of spatial joins is to merge data using geographic coordinates instead of using attributes.
Three possibilities in geopandas:
intersects
within
contains
It seems like you want contains, which is possible using the following syntax:
geopandas.sjoin(polygons, points, how="inner", op='contains')
Note: You need to have installed rtree to be able to perform such operations. If you need to install this dependency, use pip or conda to install it
Example
As an example, let's take a random sample of cities and plot countries associated. The two example datasets are
import geopandas
import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
cities = cities.sample(n=50, random_state=1)
world.head(2)
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
cities.head(3)
name geometry
196 Bogota POINT (-74.08529 4.59837)
95 Tbilisi POINT (44.78885 41.72696)
173 Seoul POINT (126.99779 37.56829)
world is a worldwide dataset and cities is a subset.
Both dataset need to be in the same projection system. If not, use .to_crs before merging.
data_merged = geopandas.sjoin(countries, cities, how="inner", op='contains')
Finally, to see the result let's do a map
f, ax = plt.subplots(1, figsize=(20,10))
data_merged.plot(axes=ax)
countries.plot(axes=ax, alpha=0.25, linewidth=0.1)
plt.show()
and the underlying dataset merges together the information we need
data_merged.head(2)
pop_est continent name_left iso_a3 gdp_md_est geometry index_right name_right
7 6909701 Oceania Papua New Guinea PNG 28020.0 MULTIPOLYGON (((141.00021 -2.60015, 142.73525 ... 59 Port Moresby
9 44293293 South America Argentina ARG 879400.0 MULTIPOLYGON (((-68.63401 -52.63637, -68.25000... 182 Buenos Aires
Here, I used inner join method but that's a parameter you can change if, for instance, you want to keep all points, including those not within a polygon.
I'm trying to add weights to my folium heatmap layer, but I can't figure out how to correctly implement this.
I have a dataframe with 3 columns: LAT, LON and VALUE. Value being the total sales of that location.
self.map = folium.Map([mlat, mlon], tiles=tiles, zoom_start=8)
locs = zip(self.data.LAT, self.data.LON, self.data.VALUE)
HeatMap(locs, radius=30, blur=10).add_to(self.map)
I tried to use the absolute sales values and I also tried to normalize sales/sales.sum(). Both give me similar results.
The problem is:
Heatmap shows stronger red levels for regions with more stores. Even if the total sales of those stores together is a lot smaller than sales of a distant and isolate large store.
Expected behaviour:
I would expect that the intensity of the heatmap should use the value of sales of each store, as sales was passed in the zip object to the HeatMap plugin.
Let's say I have 2 regions: A and B.
In region A I have 3 stores: 10 + 15 + 10 = 35 total sales.
In region B I have 1 big store: 100 total sales
I'd expect a greater intensity for region B than for region A. I noticed that a similar behaviour only occurs when the difference is very large (if I try 35 vs 5000000 then region B becomes more relevant).
My CSV file is just a random sample, like this:
LAT,LON,VALUE,DATE,DIFFLAT1,DIFFLON1
-22.4056,-53.6193,14,2010,0.0242,0.4505
-22.0516,-53.7025,12,2010,0.3137,0.6636
-22.3239,-52.9108,100,2010,0.0514,0.0002
-22.6891,-53.7424,6,2010,0.0002,0.7887
-21.8762,-53.6866,16,2010,0.7283,0.6180
-22.1861,-53.5353,11,2010,0.1420,0.2924
from folium import plugins
from folium.plugins import HeatMap
heat_df = df.loc[:,["lat","lon","weight"]]
map_hooray = folium.Map(location=[45.517999 ,-73.568184 ], zoom_start=12 )
Format: list of lists as well as lat, lon and weight
heat_data = heat_df.values.tolist()
Plot it on the map
HeatMap(heat_data,radius=13).add_to(map_hooray)
Save the map
map_hooray.save('heat_map.html')