geopandas sjoin returning empty rows - python

I have a table of polygons of all UK output areas structured as such:
newpoly
OBJECTID OA11CD LAD11CD Shape__Are Shape__Len TCITY15NM geometry
67519 67520 E00069658 E06000018 3.396296e+04 1006.464423 Nottingham POLYGON ((456069.067 340766.874, 456057.000 34...
67520 67521 E00069659 E06000018 1.014138e+05 1404.327776 Nottingham POLYGON ((456691.549 340778.104, 456557.864 34...
67521 67522 E00069660 E06000018 1.812783e+04 731.882609 Nottingham POLYGON ((456945.994 340821.233, 456969.220 34...
67522 67523 E00069661 E06000018 2.765546e+04 1112.317587 Nottingham POLYGON ((456527.178 340669.119, 456484.993 34...
67523 67524 E00069662 E06000018 3.647822e+04 964.989153 Nottingham POLYGON ((456301.845 340419.759, 456244.357 34...
and a table of points structured like:
restaurants
name latitude longitude geometry
0 Restaurant Sat Bains with rooms 52.925050 -1.167712 POINT (-1.16771 52.92505)
1 Revolution Hockley 52.954090 -1.144025 POINT (-1.14403 52.95409)
2 Revolution Cornerhouse 52.955517 -1.150088 POINT (-1.15009 52.95552)
but when i do:
spatial_join = gpd.sjoin(restaurants, newpoly, op = 'contains')
spatial_join
0 rows match.
the geometry column of the restaurants were made via:
restaurants = pd.read_csv('Restaurants_clean.csv')
restaurants = gpd.GeoDataFrame(
restaurants, geometry=gpd.points_from_xy(restaurants.longitude, restaurants.latitude))
I have tried different 'op' arguments but the same problem occurs. I am convinced that there must be a join because all UK output areas exist in the table.
Am i missing something?

You are using different projections. I am sure GeoPandas sjoin actually warns you about that. Create your point layer in the following way:
restaurants = pd.read_csv('Restaurants_clean.csv')
restaurants = gpd.GeoDataFrame(
restaurants,
geometry=gpd.points_from_xy(restaurants.longitude, restaurants.latitude),
crs=4326)
restaurants = restaurants.to_crs(newpoly.crs)
I am first specifying the CRS of input (as 4326, which is EPSG code of WS84, i.e. lon/lat coordinates) and then I am re-projecting the data to the same CRS newpoly has (I assume 27700).

Related

Geopandas buffer and intersect

I am using the geojson file from the [OpenData Vancouver][1] website and I am trying to find the zoning classifications that fall within 5 kms of a "Historical Area".
So, I am buffering all historical areas by 5kms (My data is projected), performing the intersect operation and using the intersect results as an index:
buffs = gdf_26910[gdf_26910['zoning_classification']=='Historical Area']['geometry'].buffer(5000)
gdf_26910[buffs.intersects(gdf_26910['geometry'])]
However, this is the output I am getting:
zoning_category zoning_classification zoning_district object_id geometry area centroid buffer5K
87 HA Historical Area HA-1A 78541 POLYGON ((492805.516 5458679.305, 492805.038 5... 3848.384041 POINT (492778.785 5458699.947) POLYGON ((497803.807 5458548.605, 497803.124 5...
111 HA Historical Area HA-3 78640 POLYGON ((491358.402 5458065.050, 491309.735 5... 66336.339719 POINT (491183.139 5458103.162) POLYGON ((492818.045 5453267.595, 492421.697 5...
180 HA Historical Area HA-1A 78836 POLYGON ((492925.194 5458575.204, 492929.600 5... 90566.768532 POINT (492753.969 5458456.804) POLYGON ((487583.872 5458086.263, 487564.746 5...
683 HA Historical Area HA-1 78779 POLYGON ((492925.194 5458575.204, 492802.702 5... 69052.427940 POINT (492606.372 5458621.753) POLYGON ((487874.100 5456398.633, 487789.801 5...
1208 HA Historical Area HA-2 78833 POLYGON ((492332.139 5458699.308, 492346.989 5... 179805.027166 POINT (492343.437 5458944.412) POLYGON ((489822.136 5454379.453, 489755.087 5...
Clearly, I am getting a match for the Historical Areas and not all the other geometries that intersect the buffers.
I have plotted the buffers and the output looks correct:
#Plot
base=gdf_26910.plot()
buffs.plot(ax=base, color='red', alpha=0.25)
[![enter image description here][2]][2]
I have also opened the data in QGIS and verified that there are 5 'Historical Areas' and they are all adjacent to 'Comprehensive Development'. So, the matching rows after the intersect operation should be "Comprehensive Development" at the least.
Where am I going wrong?
Two core points
need to work in meters for a 5km buffer. Hence have used estimate_utm_crs() for projection. Have also use cap_style and join_style for a more reflective buffered polygon.
have used sjoin() instead of mask approach in your code. This will effectively give duplicates, so de-dupe using pandas groupby().first()
UPDATE changed to predicate="within" and used folium to visualise (possibly helps you understand how geometry is working)
import geopandas as gpd
import folium
gdf_26910 = gpd.read_file(
"https://opendata.vancouver.ca/explore/dataset/zoning-districts-and-labels/download/?format=geojson&timezone=Europe/London&lang=en"
)
buffs = gdf_26910.loc[gdf_26910["zoning_classification"] == "Historical Area"]
# buffer is defined as km, so need a CRS in meters...
buffs = (
buffs.to_crs(buffs.estimate_utm_crs())
.buffer(5000, cap_style=2, join_style=3)
.to_crs(gdf_ha.crs)
)
# this warns so is clearly bad !
# gdf_26910[buffs.intersects(gdf_26910['geometry'])]
# some geometries intersect multiple historical areas, take first intersection from sjoin()
gdf_5km = (
gdf_26910.reset_index()
.sjoin(buffs.to_frame(), predicate="within")
.groupby("index")
.first()
.set_crs(gdf_26910.crs)
)
m = buffs.explore(name="buffer")
gdf_5km.explore("zoning_classification", m=m, name="within")
gdf_26910.explore("zoning_classification", m=m, name="all", legend=False)
folium.LayerControl().add_to(m)
m

nearest neighbour algoritm implentation for 25k store addresses to be matched with 76k points in python

I'm trying to find a solution to the nearest neighbour algorithm.
I have a list o stores (about 25000) with coordinates and a list of exagon that belong to different bordered cities i cover with deliveries described by centroid coordinates and and hexagon polygon.
I need to tell to what hexagon each address belong,
I can find the brute force solution but it need about 2 days of calculation and I am interested in finding a fast solution to run frequently if city coverages have to change.
data of stores:
INDIRIZZO_COMPLETO latitude longitude COORDINATE_EXTRACTION_DETAIL
0 LUNGOMARE LUIGI RIZZO 1, 92010 LAMPEDUSA E LIN... 35.497965 12.607482 from original address
1 VIA TERRANOVA 71, 92010 LAMPEDUSA E LINOSA (AG... 35.506421 12.610504 from original address
2 VIALE PAPA PIO XII 107/109, 00036 PALESTRINA (... 35.551062 12.320357 from zipcode: 36 Italy
3 VIA ROMA 82, 96010 PORTOPALO DI CAPO PASSERO (... 36.682967 15.133651 from original address
4 CONTRADA PIANETTI SNC, 96018 PACHINO (SR), SIC... 36.700497 15.073600 from zipcode: 96018 Italy
data of exagon:
city_code Polygon latitude longitude
0 SCN POLYGON ((10.63303663611384 44.59771368472511,... 44.597003 10.635361
1 SCN POLYGON ((10.706225086720105 44.58751732975397... 44.586805 10.708550
2 BAR POLYGON ((16.939176495419776 41.09659615583256... 41.095711 16.941403
3 BAR POLYGON ((16.925717571722554 41.10755391076213... 41.106669 16.927944
4 BAR POLYGON ((16.89992580762363 41.067339007464646... 41.066454 16.902151
I implemented a solution using this:
tree = BallTree(np.deg2rad(df[['latitude', 'longitude']].values), metric='haversine')
distances, indices = tree.query(np.deg2rad(np.c_[query_lats, query_lons]), k = 5)
r_km = 6371 # multiplier to convert to km (from unit distance)
for name, d, ind in zip(df_other['INDIRIZZO_COMPLETO'], distances, indices):
print(f"INDIRIZZO_COMPLETO {name} closest matches:")
for i, index in enumerate(ind):
print(f"\t{df['city_code'][index]} with distance {d[i]*r_km:.4f} km")
list_data = (name, df['city_code'][index], d[i]*r_km)
append_list_as_row(file_name_2, list_data)
with quite good response for some zone and totally wrong a lot of others
any suggestion?
Use geopandas.sjoin to efficiently assign points to polygons:
gdf = geopandas.GeoDataFrame(
df,
geometry=geopandas.points_from_xy(
df.longitude, df.latitude
),
)
joined = geopandas.sjoin(
gdf,
other_df,
how="left",
predicate="intersects",
)

Which multipolygon does the point of longtitude and latitude belong to in Python?

I have the longitude and latitude and the expected result is that whichever multipolygon the point is in, I get the name or ID of the multipolygon.
import geopandas as gpd
world = gpd.read_file('/Control_Areas.shp')
world.plot()
Output
0 MULTIPOLYGON (((-9837042.000 6137048.000, -983...
1 MULTIPOLYGON (((-11583146.000 5695095.000, -11...
2 MULTIPOLYGON (((-8542840.287 4154568.013, -854...
3 MULTIPOLYGON (((-10822667.912 2996855.452, -10...
4 MULTIPOLYGON (((-13050304.061 3865631.027, -13.
Previous attempts:
I have tried fiona, shapely and geopandas to get that done but I have struggled horribly to make progress on this. The closest I have gotten is the within and contains function, but the area of work I have struggled is the transformation of multipolygon to polygon successfully as well and then utilising the power of within and contains to get the desired output.
The shapefile has been downloaded from here.
world.crs gives {'init': 'epsg:3857'} (Web Mercator projection) so you should first reproject your GeoDataFrame in the WGS84 projection if you want to keep the latitude-longitude coordinate system of your point.
world = world.to_crs("EPSG:4326")
Then you can use the intersects method of GeoPandas to find the indexes of the Polygons that contain your point.
For example for the city of New York:
from shapely.geometry import Point
NY_pnt = Point(40.712784, -74.005941)
world[["ID","NAME"]][world.intersects(NY_pnt)]
which results in:
ID NAME
20 13501 NEW YORK INDEPENDENT SYSTEM OPERATOR
you can check the result with shapely within method:
NY_pnt.within(world["geometry"][20])
If you have multiple points, you can create a GeoDataFrame and use the sjoin method:
NY_pnt = Point(40.712784, -74.005941)
LA_pnt = Point(34.052235, -118.243683)
points_df = gpd.GeoDataFrame({'geometry': [NY_pnt, LA_pnt]}, crs='EPSG:4326')
results = gpd.sjoin(points_df, world, op='within')
results[['ID', 'NAME']]
Output:
ID NAME
0 13501 NEW YORK INDEPENDENT SYSTEM OPERATOR
1 11208 LOS ANGELES DEPARTMENT OF WATER AND POWER

How to extract LineString data from GeoDataFrame and match with Polygon?

Dear StackOverflow community,
Main question:
I have a GeoDataFrame containing the StreetNetwork of NYC - Manhattan (obtained through the osmnx package), where I would like to extract the coordinates (lon/lat data) from all streets which are stored as LineStrings under geometry, like so:
*
field_1 0
access
bridge
geometry LINESTRING (-73.9975944 40.7140611, -73.997492...
highway residential
junction
key 0
lanes
length 11.237
maxspeed 25 mph
name Catherine Street
oneway True
osmid 5670536
ref
service
tunnel
u 1773060097
v 42437559
width
geometry LINESTRING (-73.9975944 40.7140611, -73.997492...
Name: 0, dtype: object
*
What I am trying to do is to extract the geometry information for each line-item:
df.iloc[x][3]
The issue is that the output format is as str:
[1]: LINESTRING (-73.9975944 40.7140611, -73.9974922 40.7139962)
[2]: print(type(...))
[2]: <class 'str'>
This makes it hard to automate and process the output data. Does anyone know how to extract it so that it is already in LineString (or any other useable list/array) format?
Further question:
My overall goal is to map this street information with a shapefile for taxi zones (zones in the format of polygons) to identify which streets are in which zone, and which lon/lat areas are covered with streets within one zone (polygon). Is there any straight forward way to do this leveraging shapely, geopandas or osmnx packages (i.e. something like "polygon.contains(Point)" but in the sense of "polygon.contains(LineString)"?
Thanks a lot for your support!
For extraction coordinates of geometry you need to use the following code:
df.geometry.apply(lambda geom: geom.coords, axis=1)

How to judge whether or not a given coordinate is in a certain city?

I have some origin data that contains lat-lng coordinates, and when I use osmnx's get_nearest_edges method, I want to filter those coordinates that are not in the given city (San Francisco in this example). Is there any convenient method that implement this feature?
Here is part of my code:
roadId = ox.utils.get_nearest_edges(G, df['longitude'], df['latitude'], method='balltree')
df['startId'] = roadId[:,0]
df['endId'] = roadId[:,1]
startId = roadId[:,0]
endId = roadId[:,1]
gdf_nodes, gdf_edges = ox.graph_to_gdfs(G)
startInfo = gdf_nodes.loc[startId]
endInfo = gdf_nodes.loc[endId]
df['startLat'] = startInfo.loc[:, ['y']].values
df['startLon'] = startInfo.loc[:, ['x']].values
df['endLat'] = endInfo.loc[:, ['y']].values
df['endLon'] = endInfo.loc[:, ['x']].values
The first line's G is from this:
G = ox.graph_from_place('San Francisco, California, USA', network_type='drive')
And the output file is like this:
latitude 37.61549
longitude -122.38821
startId 65365765
endId 65365766
startLat 37.708957
startLon -122.392803
endLat 37.708785
endLon -122.393012
This example is what I want to express, because the road in the result is not in San Francisco, how can I identify it in the code and remove it?
You asked two questions. First, how to determine if a pair of lat-lng coordinates is within a city's boundary? Second, how to get the bounding box of a city? Here is how to do both with OSMnx (and shapely, which OSMnx is built on top of):
import osmnx as ox
from shapely.geometry import Point
gdf = ox.gdf_from_place('Piedmont, CA, USA')
geom = gdf.loc[0, 'geometry']
# get the bounding box of the city
geom.bounds
# determine if a point is within the city boundary
coords = (-122.24, 37.82)
geom.intersects(Point(coords))

Categories