Counting points in overlapping polygons with GeoPandas

Counting points in overlapping polygons with GeoPandas - python

I have converted a 200m x 200m point grid of Greater London into a multypolygon 500m radius buffer layer for each point in the grid. What this means is that I have over 100,000 overlapping polygons.
I also have a years worth of crime data as a point layer with lat longs (over 1.1million crimes x 12 columns of data)
I am trying to find the most efficient way to count the number of crime points in each polygon buffer. As the polygon buffers are overlapping the crime points will overlap too for all of the buffers.
The spatial join in geopandas doesn't seem to work, maybe because the polygons are overlapping? If I use "inner" join I just get a blank dataframe back. If I use "left" join then I just get all the crime rows (1.1million) with the buffer polygon columns to the right all as "nan". And vice versa if I use "right" join - just the buffer rows (100,000) with crime columns as nan. See the code below:
import pandas as pd
import geopandas as gpd
from geopandas import read_file
from pandas import read_csv
from geopandas import GeoDataFrame, points_from_xy
#import buffer polygon layer
gBuffer = read_file('London Buffer.zip')
df1 = gBuffer.head()
#import crime csv
crime = read_csv('2020-2021 London Crime.csv')
#drop nan rows from coords
crime2 = crime[crime['Longitude'].notna()]
df2 = crime2.head()
#geocode crime points
gCrime = GeoDataFrame(crime2, geometry=points_from_xy(crime2['Longitude'], crime2['Latitude']))
df3 = gCrime.head()
#set equal crs
gCrime.crs = gBuffer.crs
#spatial join data
BufferCrime = gpd.sjoin(gCrime, gBuffer, how="inner")
The other solution is to iterate over each polygon and count the number of points but this will take forever given that it has to do 100,000 x 1,100,000 iterations
# Loop over polygons with index i.
for i, poly in gBuffer.iterrows():
#list of points in this poly
pts_in_this_poly = []
#loop over all points
for j, pt in gCrime.iterrows():
if poly.geometry.contains(pt.geometry):
# Add it to the list
pts_in_this_poly.append(pt.geometry)
pts_in_polys.append(len(pts_in_this_poly))
#Add the points
gBuffer['number of Crime points'] = gpd.GeoSeries(pts_in_polys)
Any ideas what would be the best way to solve this problem?

Related

"leak" converting .csv to .nc using xarray in some points

I'm trying to transform some points that are tabulated .csv in a netcdf file.
This is my .csv file: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
In my spreadsheet, I have the unique location of each point, not regular for all area but points are spaced by 0.1 degree, an SP value per year up to 100 years forward.
To work with this data, I needed something like other sources that use netcdf data tabled in sp(time, lat, lon). So, I can evaluate and visualize the values of this specific region by year (using panoply or ncview for example).
For that, I came up with this code:
import pandas as pd
import xarray as xr
import numpy as np
csv_file = 'example.csv'
df = pd.read_csv(csv_file)
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.fillna(0)
xc.to_netcdf(csv_file + '.nc')
And I got a netcdf file like this: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
At first, my code seems to work and create my netcdf file without problems, however, I noticed that in some places I am creating some "leakage" of points, or interpolating the same values in some direction (north-south and west-east) when it shouldn't happen.
If you do a simple plot before converting to xarray you can see there are 3 west segments and one south segment
xr.sp[0].plot()
And this ends up being masked a bit when I fill the NaN with 0 and plot it again:
xc.sp[0].plot()
Checking the netcdf file using panoply I got something similar as well:
So I've start to check every-step of my code to see if I miss something.. my first guess was the melt part but I not 100% sure because if I plot df I can't see any leaking or extrapolation in the same region:
joint_axes = seaborn.jointplot(
x="lon", y="lat", data=df, s=0.5
)
contextily.add_basemap(
joint_axes.ax_joint,
crs="EPSG:4326",
source=contextily.providers.CartoDB.PositronNoLabels,
);
So anyone have any idea what's happening here?
EDIT:
Now a solution that would help me at the moment would be to fill in the missing coordinates with a value equal to 0 within my domain area using the minimum and maximum latitudes and longitudes.
My first (and unconventional) idea was to create a 0.1 x 0.1 grid with values equal to zero and feed this grid with my existing values.
However, the method using reindex would help me and I would be able to execute it in a few lines. My doubt is whether I should do this before or after the df.melt in my code.
I'm in this situation:
csv_file = '/Users/helioguerraneto/Desktop/example.csv'
df = pd.read_csv(csv_file)
lonmin, lonmax = df['lon'].min(), df['lon'].max()
latmin, latmax = df['lat'].min(), df['lat'].max()
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.reindex(lat=np.arange(latmin, latmax, 0.1), lon=np.arange(lonmin, lonmax, 0.1), fill_value=0)
xc.to_netcdf(csv_file + '.nc')
Seems like reindex is the way but I need to keep original data. I was expecting some zeros but not in all area:
EDIT2:
I think I found something might help! My goal now could be same what's happing here: How to interpolate latitude/longitude and heading in Pandas
But instead of interpolation by the nearest I just could match with the exactly coordinates. Maybe the real problem here is mix 100 hundred grids in the end..
Any suggestions?

Create polygon from list of points only if they are nearby

I have a list of points (longitude and latitude), as well as their associated point geometries in a geodataframe. All of the points should be able to be subdivided into individual polygons, as the points are generally clustered in several areas. What I would like to do is have some sort of algorithm that loops over the points and checks the the distance between the previous and current point. If the distance is sufficiently small, it would group those points together. This process would continue until the current point is too far away. It would make a polygon out of those close points, and then continue the process with the next group of points.
gdf
longitude latitude geometry
0 -76.575249 21.157229 POINT (-76.57525 21.15723)
1 -76.575035 21.157453 POINT (-76.57503 21.15745)
2 -76.575255 21.157678 POINT (-76.57526 21.15768)
3 -76.575470 21.157454 POINT (-76.57547 21.15745)
5 -112.973177 31.317333 POINT (-112.97318 31.31733)
... ... ... ...
2222 -113.492501 47.645914 POINT (-113.49250 47.64591)
2223 -113.492996 47.643609 POINT (-113.49300 47.64361)
2225 -113.492379 47.643557 POINT (-113.49238 47.64356)
2227 -113.487443 47.643142 POINT (-113.48744 47.64314)
2230 -105.022627 48.585669 POINT (-105.02263 48.58567)
So in the data above, the first 4 points would be grouped together and turned into a polygon. Then, it would move onto the next group, and so forth. Each group of points is not evenly spaced, i.e., the next group might be 7 pairs of points, and the following could be 3. Ideally, the final output would be another geodataframe that is just a bunch of polygons.

You can try DBSCAN clustering as it will automatically find the best number of clusters and you can specify a maximum distance between points ( ε ).
Using your example, the algorithm identifies two clusters.
import pandas as pd
from sklearn.cluster import DBSCAN
df = pd.DataFrame(
[
[-76.575249, 21.157229, (-76., 21.15723)],
[-76.575035, 21.157453, (-76.57503, 21.15745)],
[-76.575255, 21.157678, (-76.57526, 21.15768)],
[-76.575470, 21.157454, (-76.57547, 21.15745)],
[-112.973177, 31.317333, (-112.97318, 31.31733)],
[-113.492501, 47.645914, (-113.49250, 47.64591)],
[-113.492996, 47.643609, (-113.49300, 47.64361)],
[-113.492379, 47.643557, (-113.49238, 47.64356)],
[-113.487443, 47.643142, (-113.48744, 47.64314)],
[-105.022627, 48.585669, (-105.02263, 48.58567)]
], columns=["longitude", "latitude", "geometry"])
clustering = DBSCAN(eps=0.3, min_samples=4).fit(df[['longitude','latitude']].values)
gdf = pd.concat([df, pd.Series(clustering.labels_, name='label')], axis=1)
print(gdf)
gdf.plot.scatter(x='longitude', y='latitude', c='label')
longitude latitude geometry label
0 -76.575249 21.157229 (-76.0, 21.15723) 0
1 -76.575035 21.157453 (-76.57503, 21.15745) 0
2 -76.575255 21.157678 (-76.57526, 21.15768) 0
3 -76.575470 21.157454 (-76.57547, 21.15745) 0
4 -112.973177 31.317333 (-112.97318, 31.31733) -1 # not in cluster
5 -113.492501 47.645914 (-113.4925, 47.64591) 1
6 -113.492996 47.643609 (-113.493, 47.64361) 1
7 -113.492379 47.643557 (-113.49238, 47.64356) 1
8 -113.487443 47.643142 (-113.48744, 47.64314) 1
9 -105.022627 48.585669 (-105.02263, 48.58567) -1 # not in cluster
If we add random data to your data set, run the clustering algorithm, and filter out those data points not in clusters, you get a clearer idea of how it's working.
import numpy as np
rng = np.random.default_rng(seed=42)
arr2 = pd.DataFrame(rng.random((3000, 2)) * 100, columns=['latitude', 'longitude'])
randdf = pd.concat([df[['latitude', 'longitude']], arr2]).reset_index()
clustering = DBSCAN(eps=1, min_samples=4).fit(randdf[['longitude','latitude']].values)
labels = pd.Series(clustering.labels_, name='label')
gdf = pd.concat([randdf[['latitude', 'longitude']], labels], axis=1)
subgdf = gdf[gdf['label']> -1]
subgdf.plot.scatter(x='longitude', y='latitude', c='label', colormap='viridis', figsize=(20,10))
print(gdf['label'].value_counts())
-1 2527
16 10
3 8
10 8
50 8
...
57 4
64 4
61 4
17 4
0 4
Name: label, Length: 99, dtype: int64
Getting the clustered points from this dataframe would be relatively simple. Something like this:
subgdf['point'] = subgdf.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
subgdf.groupby(['label'])['point'].apply(list)
label
0 [(21.157229, -76.575249), (21.157453, -76.5750...
1 [(47.645914, -113.492501), (47.643609, -113.49...
2 [(46.67210037270342, 4.380376578722878), (46.5...
3 [(85.34030732681661, 23.393948586534073), (86....
4 [(81.40203846660347, 16.697291990770392), (82....
...
93 [(61.419880354359925, 23.25522624430636), (61....
94 [(50.893415175135424, 90.70863269095085), (52....
95 [(88.80586950148697, 81.17523712192651), (88.6...
96 [(34.23624333000541, 40.8156668231013), (35.86...
97 [(16.10456828199399, 67.41443008931344), (15.9...
Name: point, Length: 98, dtype: object
Although you'd probably need to do some kind of sorting to make sure you were connecting the closest points when drawing the polygons.
Similar SO question
DBSCAN from sklearn

Haversine Formula in Python (Bearing and Distance between two GPS points)
https://gis.stackexchange.com/questions/121256/creating-a-circle-with-radius-in-metres
You may be able to use the haversine formula to group points within a distance. Create polygons for each point (function below) with the formula then filter points inside from the master list of points and repeat until there are no more points.
#import modules
import numpy as np
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, GeoSeries
from shapely import geometry
from shapely.geometry import Polygon, Point
from functools import partial
import pyproj
from shapely.ops import transform
#function to create polygons on radius
def polycir(lat, lon, radius):
local_azimuthal_projection = """+proj=aeqd +R=6371000 +units=m +lat_0={} +lon_0=
{}""".format(lat, lon)
wgs84_to_aeqd = partial(
pyproj.transform,
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
pyproj.Proj(local_azimuthal_projection),
)
aeqd_to_wgs84 = partial(
pyproj.transform,
pyproj.Proj(local_azimuthal_projection),
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
)
center = Point(float(lon), float(lat))
point_transformed = transform(wgs84_to_aeqd, center)
buffer = point_transformed.buffer(radius)
# Get the polygon with lat lon coordinates
circle_poly = transform(aeqd_to_wgs84, buffer)
return circle_poly
#Convert df to gdf
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
#Create circle polygons col
gdf['polycir'] = [polycir(x, y, <'Radius in Meters'>) for x, y in zip(gdf.latitude,
gdf.longitude)]
gdf.set_geometry('polycir', inplace=True)
#You should be able to loop through the polygons and find the geometries that overlap with
# gdf_filtered = gdf[gdf.polycir.within(gdf.iloc[0,4])]

Looks like a job for k-means clustering.
You may need to be careful to how you define your distance (actual disctance "through" earth, or shortest path around?)
Turning each cluster into a polygon depends on what you want to do... just chain them or look for their convex enveloppe...

compute and plot monthly mean SST anomalies and plot with xarray multindex (pangeo tutorial gallery)

I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?

Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()

Find in what polygon is each point

I am new to Python, so I apologize for the rudimentary programming skills, I am aware I am using a bit too much "loop for" (coming from Matlab it is dragging me down).
I have millions of points (timestep, long, lat, pointID) and hundreds of irregular non-overlapping polygons (vertex_long,vertex_lat,polygonID).points and polygons format sample
I want to know what polygon contains each point.
I was able to do it this way:
from matplotlib import path
def inpolygon(lon_point, lat_point, lon_poly, lat_poly):
shape = lon_point.shape
lon_point = lon_point.reshape(-1)
lat_point = lat_point.reshape(-1)
lon_poly = lon_poly.values.reshape(-1)
lat_poly = lat_poly.values.reshape(-1)
points = [(lon_point[i], lat_point[i]) for i in range(lon_point.shape[0])]
polys = path.Path([(lon_poly[i], lat_poly[i]) for i in range(lon_poly.shape[0])])
return polys.contains_points(points).reshape(shape)
And then
import numpy as np
import pandas as pd
Areas_Lon = Areas.iloc[:,0]
Areas_Lat = Areas.iloc[:,1]
Areas_ID = Areas.iloc[:,2]
Unique_Areas = np.unique(Areas_ID)
Areas_true=np.zeros((Areas_ID.shape[0],Unique_Areas.shape[0]))
for i in range(Areas_ID.shape[0]):
for ii in range(Unique_Areas.shape[0]):
Areas_true[i,ii]=(Areas_ID[i]==Unique_Areas[ii])
Areas_Lon_Vertex=np.zeros(Unique_Areas.shape[0],dtype=object)
Areas_Lat_Vertex=np.zeros(Unique_Areas.shape[0],dtype=object)
for i in range(Unique_Areas.shape[0]):
Areas_Lon_Vertex[i]=(Areas_Lon[(Areas_true[:,i]==1)])
Areas_Lat_Vertex[i]=(Areas_Lat[(Areas_true[:,i]==1)])
import f_inpolygon as inpolygon
Areas_in=np.zeros((Unique_Areas.shape[0],Points.shape[0]))
for i in range (Unique_Areas.shape[0]):
for ii in range (PT.shape[0]):
Areas_in[i,ii]=(inpolygon.inpolygon(Points[ii,2], Points[ii,3], Areas_Lon_Vertex[i], Areas_Lat_Vertex[i]))
This way the final outcome Areas_in Areas_in format contains as many rows as polygons and as many columns as points, where every column is true=1 at the row where the point is relative to polygon index (1st given polygon ID --> 1st row, and so).
The code works but very slowly for what it is supossed to do. When locating points in a regular grid or within a point radius I have succesfully tried implement a KDtree, what increases dramatically the speed, but I can`t do the same or whatever faster to irregular non-overlapping polygons.
I have seen some related questions but rather than asking for what polygons a point is were about whether a point is inside a polygon or not.
Any idea please?

Have you tried Geopandas Spatial join?
install the Package using pip
pip install geopandas
or conda
conda install -c conda-forge geopandas
then you should able to read the data as GeoDataframe
import geopandas
df = geopandas.read_file("file_name1.csv") # you can read shp files too.
right_df = geopandas.read_file("file_name2.csv") # you can read shp files too.
# Convert into geometry column
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] # Coordinate reference system : WGS84
crs = {'init': 'epsg:4326'}
# Creating a Geographic data frame
left_df = geopandas.GeoDataFrame(df, crs=crs, geometry=geometry)
Then you can apply the sjoin
jdf = geopandas.sjoin(left_df, right_df, how='inner', op='intersects', lsuffix='left', rsuffix='right')
the option in op are:
intersects
contains
within
All should do the same in your case when you joining two geometry columns of type Polygon and Point

Python shapely: Aggregating points to shape files for a Choropleth map

I'm trying a create a Choropleth in Python3 using shapely, fiona & bokeh for display.
I have a file with about 7000 lines that have the location of a town and a counter.
Example:
54.7604;9.55827;208
54.4004;9.95918;207
53.8434;9.95271;203
53.5979;10.0013;201
53.728;10.2526;197
53.646;10.0403;196
54.3977;10.1054;193
52.4385;9.39217;193
53.815;10.3476;192
...
I want to show these in a 12,5km grid, for which a shapefile is available on
https://opendata-esri-de.opendata.arcgis.com/datasets/3c1f46241cbb4b669e18b002e4893711_0
The code I have works.
It's very slow, because it's a brute force algorithm that checks each of the 7127 grid points against all of the 7000 points.
import pandas as pd
import fiona
from shapely.geometry import Polygon, Point, MultiPoint, MultiPolygon
from shapely.prepared import prep
sf = r'c:\Temp\geo_de\Hexagone_125_km\Hexagone_125_km.shp'
shp = fiona.open(sf)
district_xy = [ [ xy for xy in feat["geometry"]["coordinates"][0]] for feat in shp]
district_poly = [ Polygon(xy) for xy in district_xy] # coords to Polygon
df_p = pd.read_csv('points_file.csv', sep=';', header=None)
df_p.columns = ('lat', 'lon', 'count')
map_points = [Point(x,y) for x,y in zip(df_p.lon, df_p.lat)] # Convert Points to Shapely Points
all_points = MultiPoint(map_points) # all points
def calc_points_per_poly(poly, points, values): # Returns total for poly
poly = prep(poly)
return sum([v for p, v in zip(points, values) if poly.contains(p)])
# this is the slow part
# for each shape this sums um the points
sum_hex = [calc_points_per_poly(x, all_points, df_p['count']) for x in district_poly]
Since this is extremly slow, I'm wondering if there is a faster way to get the num_hex value, especially, since the real world list of points may be a lot larger and a smaller grid with more shapes would deliver a better result.

I would recommend using 'geopandas' and its built-in rtree spatial index. It allows you to do the check only if there is a possibility that point lies within polygon.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon, Point
sf = 'Hexagone_125_km.shp'
shp = gpd.read_file(sf)
df_p = pd.read_csv('points_file.csv', sep=';', header=None)
df_p.columns = ('lat', 'lon', 'count')
gdf_p = gpd.GeoDataFrame(df_p, geometry=[Point(x,y) for x,y in zip(df_p.lon, df_p.lat)])
sum_hex = []
spatial_index = gdf_p.sindex
for index, row in shp.iterrows():
polygon = row.geometry
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf_p.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.within(polygon)]
sum_hex.append(sum(precise_matches['count']))
shp['sum'] = sum_hex
This solution should be faster than your. You can then plot your GeoDataFrame via Bokeh. If you want more details on spatial indexing I recommend this article by Geoff Boeing: https://geoffboeing.com/2016/10/r-tree-spatial-index-python/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.