Is it possible to get counts of intersections between two geometries using GeoPandas objects? That is, I want to count up the number of polygons or line strings in one GeoDataFrame that intersect with each polygon in another GeoDataFrame. I did not see an easy way of doing this while browsing the GeoPandas docs, but wanted to check before moving on to lower-level tools.
You want a spatial join: geopandas.tools.sjoin().
There's an example in this Jupyter Notebook — look at the section called Spatial join. This is counting a set of points (midpoints) into a set of polygons (bins). Both geometries define a GeoDataFrame.
At the time of writing, tools.sjoin() is not in the current release of geopandas. I couldn't get geopandas.tools to build in any of their branches, but I fixed it — for me anyway — in my fork. My fix is an open PR.
I don't know about a built-in tool to do this but I'm not an expert. At the same time it's easily done with some pandas magic:
import geopandas as gpd
from shapely.geometry import *
p1 = Point(.5,.5)
p2 = Point(.5,1)
p3 = Point(1,1)
poly = Polygon([(0,0), (0,2), (2,2), (2,0)])
df1 = gpd.GeoSeries([p1,p2,p3])
df2 = gpd.GeoDataFrame([poly,p3], columns=['geometries'])
f = lambda x:np.sum(df1.intersects(x))
df2['geometries'].apply(f)
Should return:
0 3
1 1
Name: geometries, dtype: int64
Lets consider 02 geometries (Points and Polygons) which intersect at least once.
Spatial join of your layers
You should write something like this :
pointsInPolygon = gpd.sjoin(points, polygons, how="inner", op='intersects')
Add a field with 1 as a constant value
You should write something like this : pointsInPolygon['const']=1
Group by the field according to the column by which you want to aggregate data
You should write something like this : pointsInPolygon.groupby(['field']).sum()
The column [const] will give you the count of intersections between your two geometries.
If you want to see others columns as well, just type something like this : df = pointsInPolygon.groupby('field').agg({'columnA':'first', 'columnB':'first', 'const':'sum'}).reset_index()
Related
i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.
To merge two dataframes on their geometries (not on column or index values), use one of geopandas's spatial joins. They have a whole section of the docs about it - it's great - give it a read!
There are two workhorse spatial join functions in geopandas:
GeoDataFrame.sjoin joins two dataframes based on a binary predicate performed on all combinations of geometries, one of intersects, contains, within, touches, crosses, or overlaps. You can specify whether you want a left, right, or inner join based on the how keyword argument
GeoDataFrame.sjoin_nearest joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, the how argument gives left, right, and inner options. Additionally, there are two arguments to sjoin_nearest not available on sjoin:
max_distance: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.
distance_col: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
You can optionally use these global geopandas.sjoin and geopandas.sjoin_nearest functions, or use the methods geopandas.GeoDataFrame.sjoin and geopandas.GeoDataFrame.sjoin_nearest. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects".
Initially, I have 2 datasets. One is dataset with 45 polygons defined in Excel and another one is geometric coordinates of points. I need to know for each geometric point in which of 45 polygons it locates.
For file with polygons, I have a csv file which recorded POLYGON(......) as objects. I want to later check whether polygon contains point with shapely. I thought that it already was polygon type, but when I import it from csv, it imports just as a string. I tried to convert this data to Polygon()
Each raw in df looks smth like (shortened on purpose)
POLYGON ((37.667289733886719 55.700740814208984,37.670955657958984 55.70050048828125)
As suggest, I also printed the first 5 raws of this dataset:
print(io.head(5))
WKT IO_ID Unnamed: 2
0 POLYGON ((37.667289733886719 55.70074081420898... 28 NaN
1 POLYGON ((37.671272277832031 55.62009048461914... 29 NaN
2 POLYGON ((37.713523864746094 55.77525711059570... 24 NaN
3 POLYGON ((37.700267791748047 55.72071075439453... 25 NaN
4 POLYGON ((37.783447265625 55.648544311523438,3... 26 NaN
And if I check datatypes of columns with polygon - it is an object format
df.dtype
WKT object
IO_ID int64
Unnamed: 2 float64
dtype: object
for polygon in df.WKT:
polygon = Polygon(polygon)
And it give me the error: 'str' object has no attribute 'array_interface'
I can't get why this happens and what can be done (I confess I am completely new to geodata). My understanding that instead of object format I need to have the data in polygon format, but somehow i can't change it to it.
To use the spatial features of geopandas, your shapes need to be geometry type, not strings. You can see what type the objects are using the dtype attribute - you should see something like the following:
In [6]: df.geometry.dtype
Out[6]: <geopandas.array.GeometryDtype at 0x17a0934c0>
If instead the output says something like dtype('O'), then you just have strings and need to convert them to a GeometryArray.
It looks like your shapes are in the "well known text" (aka wkt) format. You can convert a wkt column to a geometry column with geopandas.GeoSeries.from_wkt:
# replace string geometry representations with shapely geometries
df['geometry'] = gpd.GeoSeries.from_wkt(df['WKT'])
# initialize GeoDataFrame with the result
# ('geometry' is the default geometry column name)
gdf = gpd.GeoDataFrame(df)
At this point your GeoDataFrame gdf should have all the spatial features of geopandas, and could be used to join to a GeometryArray of points using geopandas.sjoin. Note that a regular DataFrame of points will need to first be converted into a GeoDataFrame using geopandas.points_from_xy - see e.g. this question for an example.
I have a csv file with a table that has the columns Longitude, Latitude, and Wind Speed. I have a code that takes a csv file and deletes values outside of a specified bound. I would like to retain values whose longitude/latitude is within a 0.5 lon/lat radius of a point located at -71.5 longitude and 40.5 latitude.
My example code below deletes any values whose longitude and latitude isn't between -71 to -72 and 40 to 41 respectively. Of course, this retains values within a square bound ±0.5 lon/lat around my point of interest. But I am interested in finding values within a circular bound with radius 0.5 lon/lat of my point of interest. How should I modify my code?
import pandas as pd
import numpy
df = pd.read_csv(r"C:\\Users\\xil15102\\Documents\\results\\EasternLongIsland50.csv") #file path
indexNames=df[(df['Longitude'] <= -72)|(df['Longitude']>=-71)|(df['Latitude']<=40)|(df['Latitude']>=41)].index
df.drop(indexNames,inplace=True)
df.to_csv(r"C:\\Users\\xil15102\\Documents\\results\\EasternLongIsland50.csv")
Basically you need to check if a value is a certain distance from a central point (-71.5 and 40.5); to do this use the pythagorean theorem/distance formula:
d = sqrt(dx^2+dy^2).
So programmatically, I would do this like:
from math import sqrt
drop_indices = []
for row in range(len(df)):
if (sqrt(abs(-71.5 - df[row]['Longitude'])*abs(-71.5 - df[row]['Longitude']) + abs(40.5-df[row]['Latitude'])*abs(40.5-df[row]['Latitude']))) > 0.5:
drop_indices.append(row)
df.drop(drop_indices)
Sorry that is a sort for disgusting way to get rid of the rows and your way looks much better, but the code should work.
You should write a function to calculate the distance from your point of interest and drop those. Some help here. Pretty sure the example below should work if you implement is_not_in_area as a function to calculate the distance and check if dist < 0.5.
df = df.drop(df[is_not_in_area(df.lat, df.lon)].index)
(This code lifted from here)
Edit: drop the ones that aren't in area, not the ones that are haha.
I have 4 Dataframes (ticket_data.csv, providers.csv, stations.csv and cities.csv)
In stations.csv I have 2 colls called o_city (origin city) and d_city (destination city) those two colls gives me the id of the city i need to look for in cities.csv
In cities.csv I have the lat and long of each city.
How can i calculate the distance between o_city and d_city for each ticket ? I tried to use pyproj but I didn't find a way to make it work with each ticket..
Screenshot of csv files :
ticket_data.csv
cities.csv
Welcome to StackOverflow! In your cities dataframe, assuming here it is called city_df; for each row you can use something called the haversine distance formula from Euclidean geometry to calculate the distance between two coordinate pairs on Earth's surface. Here is an example of some dummy Python3 code of roughly how you may go about this (just using two pairs of coordinates for ease of communication):
from haversine import haversine
distance = haversine((city_df[origin_lat][0], city_df[origin_lon][0]), (city_df[destination_lat][0], city_df[destination_lon][0]))
The coordinates must be in decimal degree notation as in 43.9202 instead of 43* 38" 67' notation. Given this, the output value of distance will be in km units.
Hope this helps you get closer to solving your problem!
P. S. - you may need to install haversine, as it is not in the standard libary
I want to get the geometries of the districts bordering a given district.
districts
d0 = districts[0]
gpd.sjoin(d0, districts, op='intersects')
This gives the geometry of d0 in each row. But I want the geometry of the right table in each row. Is it possible to get both left and right table geometries?
You could use join to get the geometry from the right table after your sjoin
gdf = gpd.sjoin(d0, districts, op='intersects')
gdf will have a column/series called index_rightwhich we can leverage
gdf.join(districts['geometry'], on='index_right', lsuffix='', rsuffix='_districts')
Not sure how geopandas will handle two geometries. I'm guessing all operations will leverage the original one from d0