How to limit the number of features read in using GeoPandas? - python

I have the following Python code to read my shapefile features into a GeoDataFrame using the points x, y.
import math
import shapely.geometry
import geopandas as gpd
from shapely.ops import nearest_points
absolute_path_to_shapefile = 'c:/test/test1.shp'
gdf1 = gpd.read_file(absolute_path_to_shapefile)
gdf = gpd.GeoDataFrame(
gdf1, geometry=gpd.points_from_xy(gdf1['x'], gdf1['y']))
Is there a way to limit the features read in? Some shapefiles have millions of points but I just want to read in the first 100 as proof of concept.

GeoPandas read_file() has a rows option to limit the number of rows read (or to use a slice to read specific rows).
import math
import shapely.geometry
import geopandas as gpd
from shapely.ops import nearest_points
absolute_path_to_shapefile = 'c:/test/test1.shp'
gdf1 = gpd.read_file(absolute_path_to_shapefile, rows=100)
gdf = gpd.GeoDataFrame(gdf1, geometry=gpd.points_from_xy(gdf1['x'], gdf1['y']))
GeoPandas documentation
geopandas.read_file(filename, bbox=None, mask=None, rows=None, **kwargs)
Returns a GeoDataFrame from a file or URL.
Parameters
filename: str, path object or file-like object
Either the absolute or relative path to the file or URL to be opened, or any object with a read() method (such as an open file or StringIO)
bbox: tuple | GeoDataFrame or GeoSeries | shapely Geometry, default None
Filter features by given bounding box, GeoSeries, GeoDataFrame or a shapely geometry. CRS mis-matches are resolved if given a GeoSeries or GeoDataFrame. Tuple is (minx, miny, maxx, maxy) to match the bounds property of shapely geometry objects. Cannot be used with mask.
mask: dict | GeoDataFrame or GeoSeries | shapely Geometry, default None
Filter for features that intersect with the given dict-like geojson geometry, GeoSeries, GeoDataFrame or shapely geometry. CRS mis-matches are resolved if given a GeoSeries or GeoDataFrame. Cannot be used with bbox.
rows: int or slice, default None
Load in specific rows by passing an integer (first n rows) or a slice() object.
**kwargs :
Keyword args to be passed to the open or BytesCollection method in the fiona library when opening the file. For more information on possible keywords, type: import fiona; help(fiona.open)
Returns
geopandas.GeoDataFrame or pandas.DataFrame :
If ignore_geometry=True a pandas.DataFrame will be returned.

Related

Convert CSV with WKT Polygon Geometry to Shapefile - Python

I am trying to convert a CSV file with WKT Polygon geometry to shapefile in Python, but cannot determine how to correctly integrate the geometry into a shapefile. Below is a segment of the CSV file:
ID Day Number Hours WKT
1 10 2 [12,12,13] POLYGON ((153.101112401 -27.797998206, 153.097860177 -27.807122487, 153.097715464 -27.8163131, 153.100598081 -27.821068293,...)
I am attempting to use the geopandas and shapely libraries and have found documentd to support conversion from CSV to Shapefile from Points geometry and using latitude/longtitude, but I cannot figure out how to do so without lat/lon and from Polygon geometry. When I attempted to plot the data, I get an "AttributeError: No geometry data set yet (expected in column 'geometry')". I can still generate a plot graphic, but there is no data associated with it. Once I can plot the data, I should be able to generate the desired shapefile output that preserves the attributes of the original CSV. Below is the the code I am using:
import pandas as pd
import geopandas as gpd
from shapely import wkt
test_file = pd.read_csv("C:\\Users\\mdl518\\Desktop\\sample_data.csv") ## read the CSV
test_file['geometry'] = test_file.WKT.apply(wkt.loads) ## load the WKT geometry
gdf = gpd.GeoDataFrame(test_file, geometry='geometry') ## load CRS into Geodataframe
test_file_gdf.plot(markersize = 1.5, figsize = (10,10)) ## plot the data
## Obtaining the ESRI WKT
ESRI_WKT = 'GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]]'
## Saving to shapefile
test_file_gdf.to_file(filename = "C:\\Users\\mdl518\\Desktop\\test_sample.shp", driver = "ESRI Shapefile", crs_wkt = ESRI_WKT)
I feel like this should otherwise be fairly straightforward, but I cannot figure out the missing steps in the geodataframe geometry integration, any assistance is most appreciated!

Clipping shapefile with multi-polygon shapefile in geopandas

I have a shapefile (will be called source-file hereafter), which I need to clip by a multi-polygon shapefile so that I can have a clipped shapefile for each polygon. I tried the geopandas, though I am able to clip the source-file by individually clipping the it by selecting the polygons separately from the multi-polygon shapefile, but when I try to loop over the polygons to automate the clipping process I get the following error:
Error:
TypeError: 'mask' should be GeoDataFrame, GeoSeries or(Multi)Polygon, got <class 'tuple'>
Code:
import geopandas as gpd
source = ('source-shapefile.shp')
mask = ('mask_shapefile.shp')
sourcefile = gpd.read_file(source)
maskfile = gpd.read_file(mask)
for row in maskfile.iterrows():
gpd.clip(sourcefile, row)
Two points
https://geopandas.org/en/stable/docs/reference/api/geopandas.clip.html mask can be a GeoDataFrame hence no need for looping
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html yields a tuple of the index value and named tuple of row. Hence your error is the fact your are passing this tuple to clip()
Have constructed an example. It is far simpler to clip using a GeoDataFrame as the mask.
import geopandas as gpd
import pandas as pd
# lets build a mask for use in clip, multipolygons and polygons
maskfile = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
maskfile = maskfile.loc[maskfile["continent"].eq("Europe") & maskfile["name"].ne("Russia")].pipe(
lambda d: d.assign(gdp_grp=pd.cut(d["gdp_md_est"], bins=4, labels=list("abcd")))
).dissolve("gdp_grp").reset_index()
sourcefile = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# now clip, no looping needed
gpd.clip(sourcefile, maskfile)
Finally, after 5 hours of research, I am now able clip the shapefile by a multi-polygon shapefile and save the clipped polygons separately with their respective names. Following code may be dirty
but it works.
code:
import geopandas as gpd
import pandas as pd
import os, sys
source = ('source-shapefile.shp')
mask = ('mask_shapefile.shp')
outpath = ('/outpath')
sourcefile = gpd.read_file(source)
maskfile = gpd.read_file(mask)
clipshape = maskfile.explode()
clipshape.set_index('CATCH_NAME', inplace=True) # CATCH_NAME is attribute column name
for index, row in clipshape['geometry'].iteritems():
clipped = gpd.clip(sourcefile, row)
clipped.to_file(os.path.join(outpath, f'{index}.shp'))

How to save a geosdataframe with many geomertry columns ( polygon, point and linestring) to a geojson file (or a csv file)?

I am trying to save the following geodataframe with columns (geometry, area, centroid, and boundary) to a json file using df.to_file('result.geojson', driver="GeoJSON"):
However, I get the following error because I have centroid and boundary columns.
TypeError: Cannot interpret '<geopandas.array.GeometryDtype object at 0x7fb7fff86940>' as a data type
This works perfectly fine when there is only geometry column and area column.
Also when I try to save it as a csv file and then read it back as geopandas file, I am only able to convert geometry into datatype geometry. Howver, centroid and boundary show as object datatype. How do I convert them to geometry datatype?
as per comments, both geopandas and geojson only support one geometry per feature
hence your data frame is polygons as geometry and other columns are series of objects (shapely objects)
have simulated data set. This clearly demonstrates your data structure is not normalised. boundary, centroid and area are calculated columns. Hence in relational theory not normalised
this can be saved as CSV shapely objects will be encoded as WKT
this can then simply be loaded, with second step to decode WKT back into shapely objects
have demonstrated this works by plotting geometries loaded from CSV
import geopandas as gpd
import pandas as pd
import shapely.geometry
from pathlib import Path
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# derive columns in question...
gdf["boundary"] = gdf["geometry"].apply(lambda p: p.boundary )
gdf["centroid"] = gdf["geometry"].apply(lambda p: p.centroid )
gdf["area"] = gdf["geometry"].apply(lambda p: p.area )
# save as CSV shapely.wkt.dumps will be implicitly used
gdf.to_csv(Path.cwd().joinpath("SO_geom.csv"), index=False)
# load encoded dataframe
df = pd.read_csv(Path.cwd().joinpath("SO_geom.csv"))
# decode geometry columns as strings back into shapely objects
for c in ["geometry","boundary","centroid"]:
df[c] = df[c].apply(shapely.wkt.loads)
# finally reconstruct geodataframe
gdf = gpd.GeoDataFrame(df)
# show it has worked
gdf.plot()
gpd.GeoSeries(gdf["boundary"]).plot()
gpd.GeoSeries(gdf["centroid"]).plot()

Shapely doesn't recognize the geometry type of geoJson

I'm a beginner with shapely and i'm trying to read shapefile, save it as geoJson and then use shape() in order to see the geometry type.
according to the doc, shape():
shapely.geometry.shape(context) Returns a new, independent geometry
with coordinates copied from the context.
Saving the shapefile as geoJson seems to work but for some reason when I try to use shape() on the geoJson I get error:
ValueError: Unknown geometry type: featurecollection
This is my script:
import geopandas
import numpy as np
from shapely.geometry import shape, Polygon, MultiPolygon, MultiLineString
#read shapefile:
myshpfile = geopandas.read_file('shape/myshape.shp')
myshpfile.to_file('myshape.geojson', driver='GeoJSON')
#read as GeoJson and use shape()
INPUT_FILE = 'shape/myshape.geojson'
geo_json = geopandas.read_file(INPUT_FILE)
#try to use shape()
geom = shape(geo_json)
>>>ValueError: Unknown geometry type: featurecollection
I have also tried to specify the geometry with slicing but seems like impossible.
#try to use shape()
geom = shape(geo_json.iloc[:,9])
>>>TypeError: '(slice(None, None, None), 9)' is an invalid key
Right now I can't pass this level, but my end goal is to be able to get the geometry type when print geom.geom_type (now I get the error before).
Edit:when I check the type of the saved GeoJson I get "geopandas.geodataframe.GeoDataFrame"
Your geo_json object is geopandas.GeoDataFrame which has a column of shapely geometries. There's no need to call shape. If you want to check geom_type, there's an easy way to do that directly.
import geopandas
import numpy as np
from shapely.geometry import shape, Polygon, MultiPolygon, MultiLineString
#read shapefile:
myshpfile = geopandas.read_file('shape/myshape.shp')
myshpfile.to_file('myshape.geojson', driver='GeoJSON')
#read as GeoJson and use shape()
INPUT_FILE = 'shape/myshape.geojson'
geo_json = geopandas.read_file(INPUT_FILE)
geo_json.geom_type
That will give you geom_type for each geometry in the dataframe. Maybe check geopandas documentation to get more familiar with the concept.

Converting a column of Polygons from string to GeoPandas geometry

I have a dataframe stored as csv file, one column of which is Polygon object. However, this column is stored as strings instead of GeoPandas geometry object. How can I convert this column to Geopandas geometry object so that I can perform geo analysis?
This is how my data looks like
my_df['geometry'].head()
0 POLYGON ((-122.419942 37.809021, -122.419938 3...
1 POLYGON ((-122.419942 37.809021, -122.419938 3...
2 POLYGON ((-122.419942 37.809021, -122.419938 3...
3 POLYGON ((-122.419942 37.809021, -122.419938 3...
4 POLYGON ((-122.405659 37.806674, -122.405974 3...
Name: geometry, dtype: object
I want to convert this Pandas DataFrame to Geopandas GeoDataFrame, using the column 'geometry' as the Geopandas geometry column.
my_geo_df = gpd.GeoDataFrame(my_df, geometry=my_df['geometry'])
However, as the column is stored as strings, Geopandas.DataFrame() does not recognize it and therefore cannot actually create a GeoDataFrame.
TypeError: Input geometry column must contain valid geometry objects.
The format of your polygon is WKT, so you have to convert it to shapely Polygon. Following Geopandas docs (https://geopandas.readthedocs.io/en/latest/gallery/create_geopandas_from_pandas.html) do following
Using GeoPandas 0.9+:
df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry'])
my_geo_df = gpd.GeoDataFrame(my_df, geometry='geometry')
Using older versions:
from shapely import wkt
df['geometry'] = df['geometry'].apply(wkt.loads)
my_geo_df = gpd.GeoDataFrame(my_df, geometry='geometry')

Categories