I am trying to extract countries from NetCDF3 data using the pdsi monthly mean calibrate data from: https://psl.noaa.gov/data/gridded/data.pdsi.html. I am using the following code which performs a spatial merge of coordinates and identifies countries based on a shapefile of the world.
PDSI data format
# Import shapefile from geopandas
path_to_data = geopandas.datasets.get_path("naturalearth_lowres")
world_shp = geopandas.read_file(path_to_data)
world_shp.head()
# Import netCDF file
ncs = "pdsi.mon.mean.selfcalibrated.nc"
# Read in netCDF as a pandas dataframe
# Xarray provides a simple method of opening netCDF files, and converting them to pandas dataframes
ds = xr.open_dataset(ncs)
pdsi = ds.to_dataframe()
# the index in the df is a Pandas.MultiIndex. To reset it, use df.reset_index()
pdsi = pdsi.reset_index()
# quick check for shpfile plotting
world_shp.plot(figsize=(12, 8));
# use geopandas points_from_xy() to transform Longitude and Latitude into a list of shapely.Point objects and set it as a geometry while creating the GeoDataFrame
pdsi_gdf = geopandas.GeoDataFrame(pdsi, geometry=geopandas.points_from_xy(pdsi.lon, pdsi.lat))
print(pdsi_gdf.head())
# check CRS coordinates
world_shp.crs #shapefile
pdsi_gdf.crs #geodataframe netcdf
# set coordinates equal to each other
# PointsGeodataframe.crs = PolygonsGeodataframe.crs
pdsi_gdf.crs = world_shp.crs
# check coordinates after setting coordinates equal to each other
pdsi_gdf.crs #geodataframe netcdf
#spatial join
join_inner_df = geopandas.sjoin(pdsi_gdf, world_shp, how="inner")
join_inner_df
The problem I am having is that the original data in the NetCDF format consists of spatial coverage/gridded data where the values of the key variable (pdsi) represents the area within each shaded squares (see image below). So far, only the coordinate points in the middle of the squares are being matches, and I would like each shaded square to match to each country that it is inside. For example, if the area of the shaded squares are within the boundaries of Germany and Netherlands, then the key variable should be attributed to both countries. Any help on this issue would be greatly appreciated.
NetCDF gridded data example
have sourced data that you referenced to ensure this can be re-run on any machine
core solution, a square buffer around the point https://gis.stackexchange.com/questions/314949/creating-square-buffers-around-points-using-shapely
have analysed data to ensure value used for buffer is appropriate and calculated from data
# make sure that data supports using a buffer...
assert (
gdf["lat"].diff().loc[lambda s: s.ne(0)].mode()
== gdf["lon"].diff().loc[lambda s: s.ne(0)].mode()
).all()
# how big should the square buffer be around the point??
buffer = gdf["lat"].diff().loc[lambda s: s.ne(0)].mode().values[0] / 2
gdf["geometry"] = gdf["geometry"].buffer(buffer, cap_style=3)
the remaining solution is now a spatial join
# the solution... spatial join buffered polygons to countries
# comma separate associated countries
gdf = gdf.join(
world_shp.sjoin(gdf.set_crs("EPSG:4326"))
.groupby("index_right")["name"]
.agg(",".join)
)
have used plotly to visualise. From image you can see that multiple countries have been associated with a bounding box.
complete code
import geopandas as gpd
import numpy as np
import plotly.express as px
import requests
from pathlib import Path
from zipfile import ZipFile
import urllib
import geopandas as gpd
import shapely.geometry
import xarray as xr
# download NetCDF data...
# fmt: off
url = "https://psl.noaa.gov/repository/entry/get/pdsi.mon.mean.selfcalibrated.nc?entryid=synth%3Ae570c8f9-ec09-4e89-93b4-babd5651e7a9%3AL2RhaV9wZHNpL3Bkc2kubW9uLm1lYW4uc2VsZmNhbGlicmF0ZWQubmM%3D"
f = Path.cwd().joinpath(Path(urllib.parse.urlparse(url).path).name)
# fmt: on
if not f.exists():
r = requests.get(url, stream=True, headers={"User-Agent": "XY"})
with open(f, "wb") as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
ds = xr.open_dataset(f)
pdsi = ds.to_dataframe()
pdsi = pdsi.reset_index().dropna() # don't care about places in oceans...
# use subset for testing... last 5 times...
pdsim = pdsi.loc[pdsi["time"].isin(pdsi.groupby("time").size().index[-5:])]
# create geopandas dataframe
gdf = gpd.GeoDataFrame(
pdsim, geometry=pdsim.loc[:, ["lon", "lat"]].apply(shapely.geometry.Point, axis=1)
)
# make sure that data supports using a buffer...
assert (
gdf["lat"].diff().loc[lambda s: s.ne(0)].mode()
== gdf["lon"].diff().loc[lambda s: s.ne(0)].mode()
).all()
# how big should the square buffer be around the point??
buffer = gdf["lat"].diff().loc[lambda s: s.ne(0)].mode().values[0] / 2
gdf["geometry"] = gdf["geometry"].buffer(buffer, cap_style=3)
# Import shapefile from geopandas
path_to_data = gpd.datasets.get_path("naturalearth_lowres")
world_shp = gpd.read_file(path_to_data)
# the solution... spatial join buffered polygons to countries
# comma separate associated countries
gdf = gdf.join(
world_shp.sjoin(gdf.set_crs("EPSG:4326"))
.groupby("index_right")["name"]
.agg(",".join)
)
gdf["time_a"] = gdf["time"].dt.strftime("%Y-%b-%d")
# simplest way to test is visualise...
px.choropleth_mapbox(
gdf,
geojson=gdf.geometry,
locations=gdf.index,
color="pdsi",
hover_data=["name"],
animation_frame="time_a",
opacity=.3
).update_layout(
mapbox={"style": "carto-positron", "zoom": 1},
margin={"l": 0, "r": 0, "t": 0, "b": 0},
)
Related
I have a csv file in which the 2nd and 3rd rows have lat and long values. The CSV file contains the temperature data from 2011 to 2099 in India, and I want to filter data for only the Satulaj basin using the shapefile of the Satulaj basin. How do I do this in python.
import shapefile
from shapely.geometry import shape, Point
import pandas as pd
path="D:\\THESIS\\Others\\DharamVeer_Sir\\1_Future_Climate Data\\"
df = pd.read_csv(path+"test3.csv")
path1 = "D:\\THESIS\\Others\\DharamVeer_Sir\\satulaj (1)\\Satluj\\"
# read your shapefile
r = shapefile.Reader(path1+"satulaj.shp")
# get the shapes
shapes = r.shapes()
# build a shapely polygon from your shape
polygon = shape(shapes[0])
def check(lon, lat):
# build a shapely point from your geopoint
point = Point(lon, lat)
# the contains function does exactly what you want
return polygon.contains(point)
for i in range(len(df.axes[1])):
sfile = df.values[0][i]
dst = df.values[1][i]
print(check(sfile,dst))
I have a shapefile (will be called source-file hereafter), which I need to clip by a multi-polygon shapefile so that I can have a clipped shapefile for each polygon. I tried the geopandas, though I am able to clip the source-file by individually clipping the it by selecting the polygons separately from the multi-polygon shapefile, but when I try to loop over the polygons to automate the clipping process I get the following error:
Error:
TypeError: 'mask' should be GeoDataFrame, GeoSeries or(Multi)Polygon, got <class 'tuple'>
Code:
import geopandas as gpd
source = ('source-shapefile.shp')
mask = ('mask_shapefile.shp')
sourcefile = gpd.read_file(source)
maskfile = gpd.read_file(mask)
for row in maskfile.iterrows():
gpd.clip(sourcefile, row)
Two points
https://geopandas.org/en/stable/docs/reference/api/geopandas.clip.html mask can be a GeoDataFrame hence no need for looping
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html yields a tuple of the index value and named tuple of row. Hence your error is the fact your are passing this tuple to clip()
Have constructed an example. It is far simpler to clip using a GeoDataFrame as the mask.
import geopandas as gpd
import pandas as pd
# lets build a mask for use in clip, multipolygons and polygons
maskfile = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
maskfile = maskfile.loc[maskfile["continent"].eq("Europe") & maskfile["name"].ne("Russia")].pipe(
lambda d: d.assign(gdp_grp=pd.cut(d["gdp_md_est"], bins=4, labels=list("abcd")))
).dissolve("gdp_grp").reset_index()
sourcefile = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# now clip, no looping needed
gpd.clip(sourcefile, maskfile)
Finally, after 5 hours of research, I am now able clip the shapefile by a multi-polygon shapefile and save the clipped polygons separately with their respective names. Following code may be dirty
but it works.
code:
import geopandas as gpd
import pandas as pd
import os, sys
source = ('source-shapefile.shp')
mask = ('mask_shapefile.shp')
outpath = ('/outpath')
sourcefile = gpd.read_file(source)
maskfile = gpd.read_file(mask)
clipshape = maskfile.explode()
clipshape.set_index('CATCH_NAME', inplace=True) # CATCH_NAME is attribute column name
for index, row in clipshape['geometry'].iteritems():
clipped = gpd.clip(sourcefile, row)
clipped.to_file(os.path.join(outpath, f'{index}.shp'))
I am trying to save the following geodataframe with columns (geometry, area, centroid, and boundary) to a json file using df.to_file('result.geojson', driver="GeoJSON"):
However, I get the following error because I have centroid and boundary columns.
TypeError: Cannot interpret '<geopandas.array.GeometryDtype object at 0x7fb7fff86940>' as a data type
This works perfectly fine when there is only geometry column and area column.
Also when I try to save it as a csv file and then read it back as geopandas file, I am only able to convert geometry into datatype geometry. Howver, centroid and boundary show as object datatype. How do I convert them to geometry datatype?
as per comments, both geopandas and geojson only support one geometry per feature
hence your data frame is polygons as geometry and other columns are series of objects (shapely objects)
have simulated data set. This clearly demonstrates your data structure is not normalised. boundary, centroid and area are calculated columns. Hence in relational theory not normalised
this can be saved as CSV shapely objects will be encoded as WKT
this can then simply be loaded, with second step to decode WKT back into shapely objects
have demonstrated this works by plotting geometries loaded from CSV
import geopandas as gpd
import pandas as pd
import shapely.geometry
from pathlib import Path
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# derive columns in question...
gdf["boundary"] = gdf["geometry"].apply(lambda p: p.boundary )
gdf["centroid"] = gdf["geometry"].apply(lambda p: p.centroid )
gdf["area"] = gdf["geometry"].apply(lambda p: p.area )
# save as CSV shapely.wkt.dumps will be implicitly used
gdf.to_csv(Path.cwd().joinpath("SO_geom.csv"), index=False)
# load encoded dataframe
df = pd.read_csv(Path.cwd().joinpath("SO_geom.csv"))
# decode geometry columns as strings back into shapely objects
for c in ["geometry","boundary","centroid"]:
df[c] = df[c].apply(shapely.wkt.loads)
# finally reconstruct geodataframe
gdf = gpd.GeoDataFrame(df)
# show it has worked
gdf.plot()
gpd.GeoSeries(gdf["boundary"]).plot()
gpd.GeoSeries(gdf["centroid"]).plot()
I'm trying a create a Choropleth in Python3 using shapely, fiona & bokeh for display.
I have a file with about 7000 lines that have the location of a town and a counter.
Example:
54.7604;9.55827;208
54.4004;9.95918;207
53.8434;9.95271;203
53.5979;10.0013;201
53.728;10.2526;197
53.646;10.0403;196
54.3977;10.1054;193
52.4385;9.39217;193
53.815;10.3476;192
...
I want to show these in a 12,5km grid, for which a shapefile is available on
https://opendata-esri-de.opendata.arcgis.com/datasets/3c1f46241cbb4b669e18b002e4893711_0
The code I have works.
It's very slow, because it's a brute force algorithm that checks each of the 7127 grid points against all of the 7000 points.
import pandas as pd
import fiona
from shapely.geometry import Polygon, Point, MultiPoint, MultiPolygon
from shapely.prepared import prep
sf = r'c:\Temp\geo_de\Hexagone_125_km\Hexagone_125_km.shp'
shp = fiona.open(sf)
district_xy = [ [ xy for xy in feat["geometry"]["coordinates"][0]] for feat in shp]
district_poly = [ Polygon(xy) for xy in district_xy] # coords to Polygon
df_p = pd.read_csv('points_file.csv', sep=';', header=None)
df_p.columns = ('lat', 'lon', 'count')
map_points = [Point(x,y) for x,y in zip(df_p.lon, df_p.lat)] # Convert Points to Shapely Points
all_points = MultiPoint(map_points) # all points
def calc_points_per_poly(poly, points, values): # Returns total for poly
poly = prep(poly)
return sum([v for p, v in zip(points, values) if poly.contains(p)])
# this is the slow part
# for each shape this sums um the points
sum_hex = [calc_points_per_poly(x, all_points, df_p['count']) for x in district_poly]
Since this is extremly slow, I'm wondering if there is a faster way to get the num_hex value, especially, since the real world list of points may be a lot larger and a smaller grid with more shapes would deliver a better result.
I would recommend using 'geopandas' and its built-in rtree spatial index. It allows you to do the check only if there is a possibility that point lies within polygon.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon, Point
sf = 'Hexagone_125_km.shp'
shp = gpd.read_file(sf)
df_p = pd.read_csv('points_file.csv', sep=';', header=None)
df_p.columns = ('lat', 'lon', 'count')
gdf_p = gpd.GeoDataFrame(df_p, geometry=[Point(x,y) for x,y in zip(df_p.lon, df_p.lat)])
sum_hex = []
spatial_index = gdf_p.sindex
for index, row in shp.iterrows():
polygon = row.geometry
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf_p.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.within(polygon)]
sum_hex.append(sum(precise_matches['count']))
shp['sum'] = sum_hex
This solution should be faster than your. You can then plot your GeoDataFrame via Bokeh. If you want more details on spatial indexing I recommend this article by Geoff Boeing: https://geoffboeing.com/2016/10/r-tree-spatial-index-python/
I have a 2-d gridded files which represents the land use catalogues for the place of interest.
I also have some lat/lon based point distributed in this area.
from netCDF4 import Dataset
## 2-d gridded files
nc_file = "./geo_em.d02.nc"
geo = Dataset(nc_file, 'r')
lu = geo.variables["LU_INDEX"][0,:,:]
lat = geo.variables["XLAT_M"][0,:]
lon = geo.variables["XLONG_M"][0,:]
## point files
point = pd.read_csv("./point_data.csv")
plt.pcolormesh(lon,lat,lu)
plt.scatter(point_data.lon,cf_fire_data.lat, color ='r')
I want to extract the values of the 2-d gridded field which those points belong, but I found it is difficult to define a simple function to solve that.
Is there any efficient method to achieve it?
Any advices would be appreciated.
PS
I have uploaded my files here
1. nc_file
2. point_file
I can propose solution like this, where I just loop over the points and select the data based on the distance from the point.
#/usr/bin/env ipython
import numpy as np
from netCDF4 import Dataset
import matplotlib.pylab as plt
import pandas as pd
# --------------------------------------
## 2-d gridded files
nc_file = "./geo_em.d02.nc"
geo = Dataset(nc_file, 'r')
lu = geo.variables["LU_INDEX"][0,:,:]
lat = geo.variables["XLAT_M"][0,:]
lon = geo.variables["XLONG_M"][0,:]
## point files
point = pd.read_csv("./point_data.csv")
plt.pcolormesh(lon,lat,lu)
#plt.scatter(point_data.lon,cf_fire_data.lat, color ='r')
# --------------------------------------------
# get data for points:
dataout=[];
lon_ratio=np.cos(np.mean(lat)*np.pi/180.0)
for ii in range(len(point)):
plon,plat = point.lon[ii],point.lat[ii]
distmat=np.sqrt(1./lon_ratio*(lon-plon)**2+(lat-plat)**2)
kk=np.where(distmat==np.min(distmat));
dataout.append([float(lon[kk]),float(lat[kk]),float(lu[kk])]);
# ---------------------------------------------