I have a function that assigns an id if a point is within a polygon. My function is classifying the same shapely point incorrectly. It runs over two DataFrames poly that contains the polygon in shapely format (I looked at the polygons and look correct) and df that contains the start_point in shapely format. When I run the code I get inconsistent results. The dataset I am using is big, over 2 million rows. None of the misclassified points are on the boundary of the polygon.
def inside_polygon(df, polygons):
result = np.zeros((len(df), 2), dtype=object)
for polygon in polygons[["fence_id","polygon","name"]].itertuples():
inside = np.array([point.within(polygon.polygon) for point in df["start_point"]])
result[inside, 0] = polygon.fence_id
result[inside, 1] = polygon.name
return pd.DataFrame(result, columns=["fence_id", "name"])
df.loc[:,'start_point'] = df.apply(lambda row: Point(row['start_long'], row['start_lat']), axis=1)
df["fence_id"] = None
df["name"] = None
df.loc[:, ['fence_id','name']] = inside_polygon(df, poly)
| Same Point different classification (the point is actually outside the polygon (https://i.stack.imgur.com/VdPzi.png) A | Column B |
Can someone help?
Tried using both "within" and "contain" function, same results for both, maybe the issue is on how I link the fence_id on 'poly' DataFrame with the 'df' DataFrame that contains the Points
Related
I'm trying to transform some points that are tabulated .csv in a netcdf file.
This is my .csv file: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
In my spreadsheet, I have the unique location of each point, not regular for all area but points are spaced by 0.1 degree, an SP value per year up to 100 years forward.
To work with this data, I needed something like other sources that use netcdf data tabled in sp(time, lat, lon). So, I can evaluate and visualize the values of this specific region by year (using panoply or ncview for example).
For that, I came up with this code:
import pandas as pd
import xarray as xr
import numpy as np
csv_file = 'example.csv'
df = pd.read_csv(csv_file)
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.fillna(0)
xc.to_netcdf(csv_file + '.nc')
And I got a netcdf file like this: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
At first, my code seems to work and create my netcdf file without problems, however, I noticed that in some places I am creating some "leakage" of points, or interpolating the same values in some direction (north-south and west-east) when it shouldn't happen.
If you do a simple plot before converting to xarray you can see there are 3 west segments and one south segment
xr.sp[0].plot()
And this ends up being masked a bit when I fill the NaN with 0 and plot it again:
xc.sp[0].plot()
Checking the netcdf file using panoply I got something similar as well:
So I've start to check every-step of my code to see if I miss something.. my first guess was the melt part but I not 100% sure because if I plot df I can't see any leaking or extrapolation in the same region:
joint_axes = seaborn.jointplot(
x="lon", y="lat", data=df, s=0.5
)
contextily.add_basemap(
joint_axes.ax_joint,
crs="EPSG:4326",
source=contextily.providers.CartoDB.PositronNoLabels,
);
So anyone have any idea what's happening here?
EDIT:
Now a solution that would help me at the moment would be to fill in the missing coordinates with a value equal to 0 within my domain area using the minimum and maximum latitudes and longitudes.
My first (and unconventional) idea was to create a 0.1 x 0.1 grid with values equal to zero and feed this grid with my existing values.
However, the method using reindex would help me and I would be able to execute it in a few lines. My doubt is whether I should do this before or after the df.melt in my code.
I'm in this situation:
csv_file = '/Users/helioguerraneto/Desktop/example.csv'
df = pd.read_csv(csv_file)
lonmin, lonmax = df['lon'].min(), df['lon'].max()
latmin, latmax = df['lat'].min(), df['lat'].max()
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.reindex(lat=np.arange(latmin, latmax, 0.1), lon=np.arange(lonmin, lonmax, 0.1), fill_value=0)
xc.to_netcdf(csv_file + '.nc')
Seems like reindex is the way but I need to keep original data. I was expecting some zeros but not in all area:
EDIT2:
I think I found something might help! My goal now could be same what's happing here: How to interpolate latitude/longitude and heading in Pandas
But instead of interpolation by the nearest I just could match with the exactly coordinates. Maybe the real problem here is mix 100 hundred grids in the end..
Any suggestions?
I have a GeoDataFrame with a point geometry.
From the point geometry, I want to define a square polygon geometry in a quite straightforward manner.
Given a point, the point should be the left bottom corner in a square with sides 250 units of length.
I.e, left bottom corner is the current point, right bottom corner is the current point + 250 on the x axis etc.
My naive way of doing this is the following:
Create the corners as new columns in the GeoDataFrame:
After that, I try to define a new columns as:
gdf['POLY'] = shapely.Geometry([gdf['BOTTOM_LEFT'], gdf['BOTTOM_RIGHT'], gdf['TOP_LEFT'], gdf['TOP_RIGHT']])
But this returns the following error message:
AttributeError: 'list' object has no attribute '__array_interface__'
Your implementation is close, but you can't call shapely.geometry.Polygon with an array of points - it can only be done one at a time. So the trick is to use df.apply to call Polygon on every row of the DataFrame:
gdf['geometry'] = gdf.apply(
lambda s: shapely.geometry.Polygon(
[s['BOTTOM_LEFT'], s['BOTTOM_RIGHT'], s['TOP_LEFT'], s['TOP_RIGHT']],
axis=1,
)
)
You could do that with your original point using translate:
gdf['geometry'] = gdf.apply(
lambda s: shapely.geometry.Polygon(
[
s['POINT'],
s['POINT'].translate(xoff=250),
s['POINT'].translate(yoff=250, xoff=250),
s['POINT'].translate(yoff=250),
],
axis=1,
)
)
Let's assume you have a GeoDataFrame with only single point. It is called gdf and it looks as follows:
X Y geometry
0 5 6 POINT (5.00000 6.00000)
You can access the x and y components of the point using the following lambda function:
#Access x and y components of point geometry
X = gdf.geometry.apply(lambda x: x.x)
Y = gdf.geometry.apply(lambda x: x.y)
Now you can create a square object using shapely.geometry.Polygon. You need to specify the four vertices of the square. You can do it using:
gdf_square = shapely.geometry.Polygon([[X[0], Y[0]],
[X[0]+250, Y[0]],
[X[0]+250, Y[0]+250],
[X[0], Y[0]+250]])
You can get a square polygon object as shown below:
Note that if you have many points in the GeoDataFrame, modify the last function such that it creates the square polygon for point in each row one by one.
In my case it was more than 5 times faster to build the triangles using list comprehension than using geopandas.apply :
polys = [Polygon(((x, y), (x, y+d), (x+d, y+d), (x+d, y))) for x in xs for y in ys]
gdf = gpd.GeoDataFrame(geometry=polys)
In the geopandas documentation it says that
A GeoDataFrame may also contain other columns with geometrical (shapely) objects, but only one column can be the active geometry at a time. To change which column is the active geometry column, use the set_geometry method.
I'm wondering how to use such a GeoDataFrame if the goal is to flexibly reproject the geometrical data in these various columns to one or more other coordinate reference systems. Here's what I tried.
First try
import geopandas as gpd
from shapely.geometry import Point
crs_lonlat = 'epsg:4326' #geometries entered in this crs (lon, lat in degrees)
crs_new = 'epsg:3395' #geometries needed in (among others) this crs
gdf = gpd.GeoDataFrame(crs=crs_lonlat)
gdf['geom1'] = [Point(9,53), Point(9,54)]
gdf['geom2'] = [Point(8,63), Point(8,64)]
#Working: setting geometry and reprojecting for first time.
gdf = gdf.set_geometry('geom1')
gdf = gdf.to_crs(crs_new) #geom1 is reprojected to crs_new, geom2 still in crs_lonlat
gdf
Out:
geom1 geom2
0 POINT (1001875.417 6948849.385) POINT (8 63)
1 POINT (1001875.417 7135562.568) POINT (8 64)
gdf.crs
Out: 'epsg:3395'
So far, so good. Things go off the rails if I want to set geom2 as the geometry column, and reproject that one as well:
#Not working: setting geometry and reprojecting for second time.
gdf = gdf.set_geometry('geom2') #still in crs_lonlat...
gdf.crs #...but this still says crs_new...
Out: 'epsg:3395'
gdf = gdf.to_crs(crs_new) #...so this doesn't do anything! (geom2 unchanged)
gdf
Out:
geom1 geom2
0 POINT (1001875.417 6948849.385) POINT (8.00000 63.00000)
1 POINT (1001875.417 7135562.568) POINT (8.00000 64.00000)
Ok, so, apparently, the .crs attribute of the gdf is not reset to its original value when changing the column that serves as the geometry - it seems, the crs is not stored for the individual columns. If that is the case, the only way I see to use reprojection with this dataframe, is to backtrack: start --> select column as geometry --> reproject gdf to crs_new --> use/visualize/... --> reproject gdf back to crs_lonlat --> goto start. This is not usable if I want to visualise both columns in one figure.
Second try
My second attempt was, to store the crs with each column separately, by changing the corresponding lines in the script above to:
gdf = gpd.GeoDataFrame()
gdf['geom1'] = gpd.GeoSeries([Point(9,53), Point(9,54)], crs=crs_lonlat)
gdf['geom2'] = gpd.GeoSeries([Point(8,63), Point(8,64)], crs=crs_lonlat)
However, it soon became clear that, though initialised as a GeoSeries, these columns are normal pandas Series, and don't have a .crs attribute the same way GeoSeries do:
gdf['geom1'].crs
AttributeError: 'Series' object has no attribute 'crs'
s = gpd.GeoSeries([Point(9,53), Point(9,54)], crs=crs_lonlat)
s.crs
Out: 'epsg:4326'
Is there something I'm missing here?
Is the only solution, to decide on the 'final' crs beforehand - and do all the reprojecting before adding the columns? Like so...
gdf = gpd.GeoDataFrame(crs=crs_new)
gdf['geom1'] = gpd.GeoSeries([Point(9,53), Point(9,54)], crs=crs_lonlat).to_crs(crs_new)
gdf['geom2'] = gpd.GeoSeries([Point(8,63), Point(8,64)], crs=crs_lonlat).to_crs(crs_new)
#no more reprojecting done/necessary/possible! :/
...and then, when another crs is needed, rebuild the entire gdf from scratch? That can't be the way this was intended to be used.
Unfortunately, the desired behaviour is currently not possible. Due to limitations in the package, geopandas does not accommodate this use case at the moment, as can be seen in this issue in the github repo.
My workaround is to not use a GeoDataFrame at all, but rather combine a normal pandas DataFrame, for the non-shapely data, with several seperate geopandas GeoSeries, for the shapely geometry data. The GeoSeries each have their own crs and can be correctly reprojected whenever necessary.
So I have a geopandas dataframe of ~10,000 rows like this. Each point is within the polygon (I've made sure of it).
point name field_id geometry
POINT(-0.1618445 51.5103873) polygon1 1 POLYGON ((-0.1642799 51.5113756, -0.1639581 51.5089851, -0.1593661 51.5096729, -0.1606536 51.5115358, -0.1642799 51.5113756))
I want to add a new column called distance_to_nearest_edge. Which is the distance from the point to the nearest boundary of the polygon.
There is a shapely function that calculates what I want:
from shapely import wkt
poly = wkt.loads('POLYGON ((-0.1642799 51.5113756, -0.1639581 51.5089851, -0.1593661 51.5096729, -0.1606536 51.5115358, -0.1642799 51.5113756))')
pt = wkt.loads('POINT(-0.1618445 51.5103873)')
dist = poly.boundary.distance(pt)
---
dist = 0.0010736436340879488
But I'm struggling to apply this to 10k rows.
I've tried creating a function, but I keep getting errors ("'Polygon' object has no attribute 'encode'", 'occurred at index 0')
Eg:
def fxy(x, y):
poly = wkt.loads(x)
pt = wkt.loads(y)
return poly.exterior.distance(pt)
Appreciate any help!
I think your data has missing values.
you can try this:
df['distance'] = df.apply(lambda row : row['point'].distance(row['geometry'].boundary) if pd.notnull(row['point']) & pd.notnull(row['geometry']) else np.nan, axis=1)
I need filter rows of a dataframe within a multipolygon. My multypolygon is stored in gdf_polygon and my points are stored in gdf. Here is a bit resume of how they look.
gdf_polygon
id geometry
0 MULTIPOLYGON (((39.81239 21.43429, 39.81445 21...
gdf
id geometry
0 POINT (50.05832 26.43992)
... ...
The problem is that when I tried to check if there are any points inside it return False, but I know there are some points inside the polygon.
Basically, if I run this I have False as output.
gdf_polygon.geometry.contains(gdf.geometry).any()
Otherwise, if I run this I have True as output, because that point is inside the polygon.
gdf_polygon.geometry.contains(gdf.geometry[141828])
I know I could iterate through all the rows of gdf and run the contains for each one, but since my dataset is quite big (around 30.000.000 rows) that would be very inefficient. So I was looking for an explanation or possible fixes.
My dataframes creations is:
crs = {'init': 'epsg:4326'}
df = pd.read_csv(FOLDER+file, compression='gzip', escapechar='\\')
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df,crs=crs, geometry=geometry)
inside = gdf.geometry.within(gdf_polygon.geometry)
When comparing two GeoSeries in contains geopandas aligns them, see https://gis.stackexchange.com/questions/345785/geopandas-intersect-function-gives-different-result-to-shapely/345822#345822 for explanation.
To make your code work as intended, you need to compare your GeoSeries of points with the multi polygon geometry itself. And do it vice versa, using within.
polygon = gdf_polygon.geometry.iloc[0]
gdf.geometry.within(polygon)