I need filter rows of a dataframe within a multipolygon. My multypolygon is stored in gdf_polygon and my points are stored in gdf. Here is a bit resume of how they look.
gdf_polygon
id geometry
0 MULTIPOLYGON (((39.81239 21.43429, 39.81445 21...
gdf
id geometry
0 POINT (50.05832 26.43992)
... ...
The problem is that when I tried to check if there are any points inside it return False, but I know there are some points inside the polygon.
Basically, if I run this I have False as output.
gdf_polygon.geometry.contains(gdf.geometry).any()
Otherwise, if I run this I have True as output, because that point is inside the polygon.
gdf_polygon.geometry.contains(gdf.geometry[141828])
I know I could iterate through all the rows of gdf and run the contains for each one, but since my dataset is quite big (around 30.000.000 rows) that would be very inefficient. So I was looking for an explanation or possible fixes.
My dataframes creations is:
crs = {'init': 'epsg:4326'}
df = pd.read_csv(FOLDER+file, compression='gzip', escapechar='\\')
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df,crs=crs, geometry=geometry)
inside = gdf.geometry.within(gdf_polygon.geometry)
When comparing two GeoSeries in contains geopandas aligns them, see https://gis.stackexchange.com/questions/345785/geopandas-intersect-function-gives-different-result-to-shapely/345822#345822 for explanation.
To make your code work as intended, you need to compare your GeoSeries of points with the multi polygon geometry itself. And do it vice versa, using within.
polygon = gdf_polygon.geometry.iloc[0]
gdf.geometry.within(polygon)
Related
I have a function that assigns an id if a point is within a polygon. My function is classifying the same shapely point incorrectly. It runs over two DataFrames poly that contains the polygon in shapely format (I looked at the polygons and look correct) and df that contains the start_point in shapely format. When I run the code I get inconsistent results. The dataset I am using is big, over 2 million rows. None of the misclassified points are on the boundary of the polygon.
def inside_polygon(df, polygons):
result = np.zeros((len(df), 2), dtype=object)
for polygon in polygons[["fence_id","polygon","name"]].itertuples():
inside = np.array([point.within(polygon.polygon) for point in df["start_point"]])
result[inside, 0] = polygon.fence_id
result[inside, 1] = polygon.name
return pd.DataFrame(result, columns=["fence_id", "name"])
df.loc[:,'start_point'] = df.apply(lambda row: Point(row['start_long'], row['start_lat']), axis=1)
df["fence_id"] = None
df["name"] = None
df.loc[:, ['fence_id','name']] = inside_polygon(df, poly)
| Same Point different classification (the point is actually outside the polygon (https://i.stack.imgur.com/VdPzi.png) A | Column B |
Can someone help?
Tried using both "within" and "contain" function, same results for both, maybe the issue is on how I link the fence_id on 'poly' DataFrame with the 'df' DataFrame that contains the Points
I have a geopandas dataframe consisting of a combination of LineStrings and MultiLineStrings. I would like to select those LineStrings and MultiLineStrings containing a point within a box (defined by me) of latitude longitude, for which I don't have a geometry. In other words, I have some mapped USGS fault traces and I would like to pick a square inset of those fault lines within a certain distance from some lat/lons. So far I've had some success unwrapping just coordinates from the entire data frame and only saving points that fall within a box of lat/lon, but then I no longer keep the original geometry or information saved in the data frame. (i.e. like this:)
xvals=[]
yvals=[]
for flt in qfaults['geometry']:
for coord in flt.coords:
if coord[1] >= centroid[1]-1 and coord[1] <= centroid[1]+1 and coord[0]<=centroid[0]+1 and coord[0]>=centroid[0]-1:
xvals.append(coord[0])
yvals.append(coord[1])
Is there any intuition as to how to do this using the GeoPandas data frame? Thanks in advance.
GeoPandas has .cx indexer which works exactly like this. See https://geopandas.readthedocs.io/en/latest/docs/user_guide/indexing.html
Syntax is gdf.cx[xmin:xmax, ymin:ymax]
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
southern_world = world.cx[:, :0]
western_world = world.cx[:0, :]
western_europe = world.cx[1:10, 40:60]
I am new to Python, so I apologize for the rudimentary programming skills, I am aware I am using a bit too much "loop for" (coming from Matlab it is dragging me down).
I have millions of points (timestep, long, lat, pointID) and hundreds of irregular non-overlapping polygons (vertex_long,vertex_lat,polygonID).points and polygons format sample
I want to know what polygon contains each point.
I was able to do it this way:
from matplotlib import path
def inpolygon(lon_point, lat_point, lon_poly, lat_poly):
shape = lon_point.shape
lon_point = lon_point.reshape(-1)
lat_point = lat_point.reshape(-1)
lon_poly = lon_poly.values.reshape(-1)
lat_poly = lat_poly.values.reshape(-1)
points = [(lon_point[i], lat_point[i]) for i in range(lon_point.shape[0])]
polys = path.Path([(lon_poly[i], lat_poly[i]) for i in range(lon_poly.shape[0])])
return polys.contains_points(points).reshape(shape)
And then
import numpy as np
import pandas as pd
Areas_Lon = Areas.iloc[:,0]
Areas_Lat = Areas.iloc[:,1]
Areas_ID = Areas.iloc[:,2]
Unique_Areas = np.unique(Areas_ID)
Areas_true=np.zeros((Areas_ID.shape[0],Unique_Areas.shape[0]))
for i in range(Areas_ID.shape[0]):
for ii in range(Unique_Areas.shape[0]):
Areas_true[i,ii]=(Areas_ID[i]==Unique_Areas[ii])
Areas_Lon_Vertex=np.zeros(Unique_Areas.shape[0],dtype=object)
Areas_Lat_Vertex=np.zeros(Unique_Areas.shape[0],dtype=object)
for i in range(Unique_Areas.shape[0]):
Areas_Lon_Vertex[i]=(Areas_Lon[(Areas_true[:,i]==1)])
Areas_Lat_Vertex[i]=(Areas_Lat[(Areas_true[:,i]==1)])
import f_inpolygon as inpolygon
Areas_in=np.zeros((Unique_Areas.shape[0],Points.shape[0]))
for i in range (Unique_Areas.shape[0]):
for ii in range (PT.shape[0]):
Areas_in[i,ii]=(inpolygon.inpolygon(Points[ii,2], Points[ii,3], Areas_Lon_Vertex[i], Areas_Lat_Vertex[i]))
This way the final outcome Areas_in Areas_in format contains as many rows as polygons and as many columns as points, where every column is true=1 at the row where the point is relative to polygon index (1st given polygon ID --> 1st row, and so).
The code works but very slowly for what it is supossed to do. When locating points in a regular grid or within a point radius I have succesfully tried implement a KDtree, what increases dramatically the speed, but I can`t do the same or whatever faster to irregular non-overlapping polygons.
I have seen some related questions but rather than asking for what polygons a point is were about whether a point is inside a polygon or not.
Any idea please?
Have you tried Geopandas Spatial join?
install the Package using pip
pip install geopandas
or conda
conda install -c conda-forge geopandas
then you should able to read the data as GeoDataframe
import geopandas
df = geopandas.read_file("file_name1.csv") # you can read shp files too.
right_df = geopandas.read_file("file_name2.csv") # you can read shp files too.
# Convert into geometry column
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] # Coordinate reference system : WGS84
crs = {'init': 'epsg:4326'}
# Creating a Geographic data frame
left_df = geopandas.GeoDataFrame(df, crs=crs, geometry=geometry)
Then you can apply the sjoin
jdf = geopandas.sjoin(left_df, right_df, how='inner', op='intersects', lsuffix='left', rsuffix='right')
the option in op are:
intersects
contains
within
All should do the same in your case when you joining two geometry columns of type Polygon and Point
In the geopandas documentation it says that
A GeoDataFrame may also contain other columns with geometrical (shapely) objects, but only one column can be the active geometry at a time. To change which column is the active geometry column, use the set_geometry method.
I'm wondering how to use such a GeoDataFrame if the goal is to flexibly reproject the geometrical data in these various columns to one or more other coordinate reference systems. Here's what I tried.
First try
import geopandas as gpd
from shapely.geometry import Point
crs_lonlat = 'epsg:4326' #geometries entered in this crs (lon, lat in degrees)
crs_new = 'epsg:3395' #geometries needed in (among others) this crs
gdf = gpd.GeoDataFrame(crs=crs_lonlat)
gdf['geom1'] = [Point(9,53), Point(9,54)]
gdf['geom2'] = [Point(8,63), Point(8,64)]
#Working: setting geometry and reprojecting for first time.
gdf = gdf.set_geometry('geom1')
gdf = gdf.to_crs(crs_new) #geom1 is reprojected to crs_new, geom2 still in crs_lonlat
gdf
Out:
geom1 geom2
0 POINT (1001875.417 6948849.385) POINT (8 63)
1 POINT (1001875.417 7135562.568) POINT (8 64)
gdf.crs
Out: 'epsg:3395'
So far, so good. Things go off the rails if I want to set geom2 as the geometry column, and reproject that one as well:
#Not working: setting geometry and reprojecting for second time.
gdf = gdf.set_geometry('geom2') #still in crs_lonlat...
gdf.crs #...but this still says crs_new...
Out: 'epsg:3395'
gdf = gdf.to_crs(crs_new) #...so this doesn't do anything! (geom2 unchanged)
gdf
Out:
geom1 geom2
0 POINT (1001875.417 6948849.385) POINT (8.00000 63.00000)
1 POINT (1001875.417 7135562.568) POINT (8.00000 64.00000)
Ok, so, apparently, the .crs attribute of the gdf is not reset to its original value when changing the column that serves as the geometry - it seems, the crs is not stored for the individual columns. If that is the case, the only way I see to use reprojection with this dataframe, is to backtrack: start --> select column as geometry --> reproject gdf to crs_new --> use/visualize/... --> reproject gdf back to crs_lonlat --> goto start. This is not usable if I want to visualise both columns in one figure.
Second try
My second attempt was, to store the crs with each column separately, by changing the corresponding lines in the script above to:
gdf = gpd.GeoDataFrame()
gdf['geom1'] = gpd.GeoSeries([Point(9,53), Point(9,54)], crs=crs_lonlat)
gdf['geom2'] = gpd.GeoSeries([Point(8,63), Point(8,64)], crs=crs_lonlat)
However, it soon became clear that, though initialised as a GeoSeries, these columns are normal pandas Series, and don't have a .crs attribute the same way GeoSeries do:
gdf['geom1'].crs
AttributeError: 'Series' object has no attribute 'crs'
s = gpd.GeoSeries([Point(9,53), Point(9,54)], crs=crs_lonlat)
s.crs
Out: 'epsg:4326'
Is there something I'm missing here?
Is the only solution, to decide on the 'final' crs beforehand - and do all the reprojecting before adding the columns? Like so...
gdf = gpd.GeoDataFrame(crs=crs_new)
gdf['geom1'] = gpd.GeoSeries([Point(9,53), Point(9,54)], crs=crs_lonlat).to_crs(crs_new)
gdf['geom2'] = gpd.GeoSeries([Point(8,63), Point(8,64)], crs=crs_lonlat).to_crs(crs_new)
#no more reprojecting done/necessary/possible! :/
...and then, when another crs is needed, rebuild the entire gdf from scratch? That can't be the way this was intended to be used.
Unfortunately, the desired behaviour is currently not possible. Due to limitations in the package, geopandas does not accommodate this use case at the moment, as can be seen in this issue in the github repo.
My workaround is to not use a GeoDataFrame at all, but rather combine a normal pandas DataFrame, for the non-shapely data, with several seperate geopandas GeoSeries, for the shapely geometry data. The GeoSeries each have their own crs and can be correctly reprojected whenever necessary.
I am hoping to create a region on a map and be able to automatically determine if points (coordinates) are inside that region. I'm using a geojson file of the entire US and coordinates for New York City for this example.
Geojson: https://github.com/johan/world.geo.json
I have read the shapely documentation and just can't figure out why my results are returning False. Any help would be much appreciated.
import json
from shapely.geometry import shape, GeometryCollection, Point
with open('USA.geo.json', 'r') as f:
js = json.load(f)
point = Point(40.712776, -74.005974)
for feature in js['features']:
polygon = shape(feature['geometry'])
if polygon.contains(point):
print ('Found containing polygon:', feature)
I'm hoping to print the contained coordinates, but nothing is printed.
You need to swap the values of the Point() around:
point = Point(-74.005974, 40.712776)
The dataset you're using has the longitude first and the latitude second in their coordinates.