Python: how to convert geotiff to geopandas? - python

I have a geotiff file.
import xarray as xr
urbanData = xr.open_rasterio('myGeotiff.tif')
plt.imshow(urbanData)
Here the link to the file.
I can convert the file as a dataframe with coordinates as points
ur = xr.DataArray(urbanData, name='myData')
ur = ur.to_dataframe().reset_index()
gdfur = gpd.GeoDataFrame(ur, geometry=gpd.points_from_xy(ur.x, ur.y))
However I would like to get a dataframe that contains the geometry of the pixels as polygons and not as points. Is it possible?

Somewhat to my surprise, I haven't really found a package which wrap rasterio.features to take DataArrays and produce GeoDataFrames.
These might be very useful though:
https://corteva.github.io/geocube/stable/
https://corteva.github.io/rioxarray/stable/
I generally use something like this:
import affine
import geopandas as gpd
import rasterio.features
import xarray as xr
import shapely.geometry as sg
def polygonize(da: xr.DataArray) -> gpd.GeoDataFrame:
"""
Polygonize a 2D-DataArray into a GeoDataFrame of polygons.
Parameters
----------
da : xr.DataArray
Returns
-------
polygonized : geopandas.GeoDataFrame
"""
if da.dims != ("y", "x"):
raise ValueError('Dimensions must be ("y", "x")')
values = da.values
transform = da.attrs.get("transform", None)
if transform is None:
raise ValueError("transform is required in da.attrs")
transform = affine.Affine(*transform)
shapes = rasterio.features.shapes(values, transform=transform)
geometries = []
colvalues = []
for (geom, colval) in shapes:
geometries.append(sg.Polygon(geom["coordinates"][0]))
colvalues.append(colval)
gdf = gpd.GeoDataFrame({"value": colvalues, "geometry": geometries})
gdf.crs = da.attrs.get("crs")
return gdf
Note that you should squeeze the band dimensions from your xarray first to make it 2D, after reading it with xr.open_rasterio:
urbanData = xr.open_rasterio('myGeotiff.tif').squeeze('band', drop=True)

Related

FIlter the CSV file using Shapefile in python

I have a csv file in which the 2nd and 3rd rows have lat and long values. The CSV file contains the temperature data from 2011 to 2099 in India, and I want to filter data for only the Satulaj basin using the shapefile of the Satulaj basin. How do I do this in python.
import shapefile
from shapely.geometry import shape, Point
import pandas as pd
path="D:\\THESIS\\Others\\DharamVeer_Sir\\1_Future_Climate Data\\"
df = pd.read_csv(path+"test3.csv")
path1 = "D:\\THESIS\\Others\\DharamVeer_Sir\\satulaj (1)\\Satluj\\"
# read your shapefile
r = shapefile.Reader(path1+"satulaj.shp")
# get the shapes
shapes = r.shapes()
# build a shapely polygon from your shape
polygon = shape(shapes[0])
def check(lon, lat):
# build a shapely point from your geopoint
point = Point(lon, lat)
# the contains function does exactly what you want
return polygon.contains(point)
for i in range(len(df.axes[1])):
sfile = df.values[0][i]
dst = df.values[1][i]
print(check(sfile,dst))

Extracting countries from NetCDF data using geopandas

I am trying to extract countries from NetCDF3 data using the pdsi monthly mean calibrate data from: https://psl.noaa.gov/data/gridded/data.pdsi.html. I am using the following code which performs a spatial merge of coordinates and identifies countries based on a shapefile of the world.
PDSI data format
# Import shapefile from geopandas
path_to_data = geopandas.datasets.get_path("naturalearth_lowres")
world_shp = geopandas.read_file(path_to_data)
world_shp.head()
# Import netCDF file
ncs = "pdsi.mon.mean.selfcalibrated.nc"
# Read in netCDF as a pandas dataframe
# Xarray provides a simple method of opening netCDF files, and converting them to pandas dataframes
ds = xr.open_dataset(ncs)
pdsi = ds.to_dataframe()
# the index in the df is a Pandas.MultiIndex. To reset it, use df.reset_index()
pdsi = pdsi.reset_index()
# quick check for shpfile plotting
world_shp.plot(figsize=(12, 8));
# use geopandas points_from_xy() to transform Longitude and Latitude into a list of shapely.Point objects and set it as a geometry while creating the GeoDataFrame
pdsi_gdf = geopandas.GeoDataFrame(pdsi, geometry=geopandas.points_from_xy(pdsi.lon, pdsi.lat))
print(pdsi_gdf.head())
# check CRS coordinates
world_shp.crs #shapefile
pdsi_gdf.crs #geodataframe netcdf
# set coordinates equal to each other
# PointsGeodataframe.crs = PolygonsGeodataframe.crs
pdsi_gdf.crs = world_shp.crs
# check coordinates after setting coordinates equal to each other
pdsi_gdf.crs #geodataframe netcdf
#spatial join
join_inner_df = geopandas.sjoin(pdsi_gdf, world_shp, how="inner")
join_inner_df
The problem I am having is that the original data in the NetCDF format consists of spatial coverage/gridded data where the values of the key variable (pdsi) represents the area within each shaded squares (see image below). So far, only the coordinate points in the middle of the squares are being matches, and I would like each shaded square to match to each country that it is inside. For example, if the area of the shaded squares are within the boundaries of Germany and Netherlands, then the key variable should be attributed to both countries. Any help on this issue would be greatly appreciated.
NetCDF gridded data example
have sourced data that you referenced to ensure this can be re-run on any machine
core solution, a square buffer around the point https://gis.stackexchange.com/questions/314949/creating-square-buffers-around-points-using-shapely
have analysed data to ensure value used for buffer is appropriate and calculated from data
# make sure that data supports using a buffer...
assert (
gdf["lat"].diff().loc[lambda s: s.ne(0)].mode()
== gdf["lon"].diff().loc[lambda s: s.ne(0)].mode()
).all()
# how big should the square buffer be around the point??
buffer = gdf["lat"].diff().loc[lambda s: s.ne(0)].mode().values[0] / 2
gdf["geometry"] = gdf["geometry"].buffer(buffer, cap_style=3)
the remaining solution is now a spatial join
# the solution... spatial join buffered polygons to countries
# comma separate associated countries
gdf = gdf.join(
world_shp.sjoin(gdf.set_crs("EPSG:4326"))
.groupby("index_right")["name"]
.agg(",".join)
)
have used plotly to visualise. From image you can see that multiple countries have been associated with a bounding box.
complete code
import geopandas as gpd
import numpy as np
import plotly.express as px
import requests
from pathlib import Path
from zipfile import ZipFile
import urllib
import geopandas as gpd
import shapely.geometry
import xarray as xr
# download NetCDF data...
# fmt: off
url = "https://psl.noaa.gov/repository/entry/get/pdsi.mon.mean.selfcalibrated.nc?entryid=synth%3Ae570c8f9-ec09-4e89-93b4-babd5651e7a9%3AL2RhaV9wZHNpL3Bkc2kubW9uLm1lYW4uc2VsZmNhbGlicmF0ZWQubmM%3D"
f = Path.cwd().joinpath(Path(urllib.parse.urlparse(url).path).name)
# fmt: on
if not f.exists():
r = requests.get(url, stream=True, headers={"User-Agent": "XY"})
with open(f, "wb") as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
ds = xr.open_dataset(f)
pdsi = ds.to_dataframe()
pdsi = pdsi.reset_index().dropna() # don't care about places in oceans...
# use subset for testing... last 5 times...
pdsim = pdsi.loc[pdsi["time"].isin(pdsi.groupby("time").size().index[-5:])]
# create geopandas dataframe
gdf = gpd.GeoDataFrame(
pdsim, geometry=pdsim.loc[:, ["lon", "lat"]].apply(shapely.geometry.Point, axis=1)
)
# make sure that data supports using a buffer...
assert (
gdf["lat"].diff().loc[lambda s: s.ne(0)].mode()
== gdf["lon"].diff().loc[lambda s: s.ne(0)].mode()
).all()
# how big should the square buffer be around the point??
buffer = gdf["lat"].diff().loc[lambda s: s.ne(0)].mode().values[0] / 2
gdf["geometry"] = gdf["geometry"].buffer(buffer, cap_style=3)
# Import shapefile from geopandas
path_to_data = gpd.datasets.get_path("naturalearth_lowres")
world_shp = gpd.read_file(path_to_data)
# the solution... spatial join buffered polygons to countries
# comma separate associated countries
gdf = gdf.join(
world_shp.sjoin(gdf.set_crs("EPSG:4326"))
.groupby("index_right")["name"]
.agg(",".join)
)
gdf["time_a"] = gdf["time"].dt.strftime("%Y-%b-%d")
# simplest way to test is visualise...
px.choropleth_mapbox(
gdf,
geojson=gdf.geometry,
locations=gdf.index,
color="pdsi",
hover_data=["name"],
animation_frame="time_a",
opacity=.3
).update_layout(
mapbox={"style": "carto-positron", "zoom": 1},
margin={"l": 0, "r": 0, "t": 0, "b": 0},
)

Convert a column of GeoJSON-like strings to geometry objects in GeoPandas

I have a column in a GeoPandas dataframe with strings like this one '{type=Point, coordinates=[37.55, 55.71]}' or this '{type=MultiPoint, coordinates=[[37.6, 55.4]]}'. It can be a polygon or any other geometry as well. Then there are a few points in the form of nested list. How can I transform it to the ordinary GeoPandas geometry objects?
Use shapely.geometry.shape to convert geojson strings to shapely geometry.
from shapely.geometry import shape
df['geometry'] = df.apply(lambda: row: shape(row['jsoncolumn']), axis=1)
I implemented it as follows. Thanks to #martinfleis
# Add necessary shapes and keys
coordinates = 'coordinates'
type = 'type'
Point = 'Point'
MultiPoint = 'MultiPoint'
Polygon = 'Polygon'
MultiPolygon = 'MultiPolygon'
center='center'
df['geometry'] = df.geoData.apply(lambda x: shape(eval(x.replace('=',':'))))
From this source : on github
I built the following function :
import geopandas as gpd
import geojson
import json
def geojsonification(x):
geom = x['geom']
if type(geom) == dict:
s = json.dumps(geom)
s2 = geojson.loads(s)
res = shape(s2)
return res
else:
return np.nan
Which you can use as this :
gdf.geometry = gdf.apply(geojsonification, axis=1)

Python shapely: Aggregating points to shape files for a Choropleth map

I'm trying a create a Choropleth in Python3 using shapely, fiona & bokeh for display.
I have a file with about 7000 lines that have the location of a town and a counter.
Example:
54.7604;9.55827;208
54.4004;9.95918;207
53.8434;9.95271;203
53.5979;10.0013;201
53.728;10.2526;197
53.646;10.0403;196
54.3977;10.1054;193
52.4385;9.39217;193
53.815;10.3476;192
...
I want to show these in a 12,5km grid, for which a shapefile is available on
https://opendata-esri-de.opendata.arcgis.com/datasets/3c1f46241cbb4b669e18b002e4893711_0
The code I have works.
It's very slow, because it's a brute force algorithm that checks each of the 7127 grid points against all of the 7000 points.
import pandas as pd
import fiona
from shapely.geometry import Polygon, Point, MultiPoint, MultiPolygon
from shapely.prepared import prep
sf = r'c:\Temp\geo_de\Hexagone_125_km\Hexagone_125_km.shp'
shp = fiona.open(sf)
district_xy = [ [ xy for xy in feat["geometry"]["coordinates"][0]] for feat in shp]
district_poly = [ Polygon(xy) for xy in district_xy] # coords to Polygon
df_p = pd.read_csv('points_file.csv', sep=';', header=None)
df_p.columns = ('lat', 'lon', 'count')
map_points = [Point(x,y) for x,y in zip(df_p.lon, df_p.lat)] # Convert Points to Shapely Points
all_points = MultiPoint(map_points) # all points
def calc_points_per_poly(poly, points, values): # Returns total for poly
poly = prep(poly)
return sum([v for p, v in zip(points, values) if poly.contains(p)])
# this is the slow part
# for each shape this sums um the points
sum_hex = [calc_points_per_poly(x, all_points, df_p['count']) for x in district_poly]
Since this is extremly slow, I'm wondering if there is a faster way to get the num_hex value, especially, since the real world list of points may be a lot larger and a smaller grid with more shapes would deliver a better result.
I would recommend using 'geopandas' and its built-in rtree spatial index. It allows you to do the check only if there is a possibility that point lies within polygon.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon, Point
sf = 'Hexagone_125_km.shp'
shp = gpd.read_file(sf)
df_p = pd.read_csv('points_file.csv', sep=';', header=None)
df_p.columns = ('lat', 'lon', 'count')
gdf_p = gpd.GeoDataFrame(df_p, geometry=[Point(x,y) for x,y in zip(df_p.lon, df_p.lat)])
sum_hex = []
spatial_index = gdf_p.sindex
for index, row in shp.iterrows():
polygon = row.geometry
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf_p.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.within(polygon)]
sum_hex.append(sum(precise_matches['count']))
shp['sum'] = sum_hex
This solution should be faster than your. You can then plot your GeoDataFrame via Bokeh. If you want more details on spatial indexing I recommend this article by Geoff Boeing: https://geoffboeing.com/2016/10/r-tree-spatial-index-python/

Extracting the value of 2-d gridded field by scatter points in Python

I have a 2-d gridded files which represents the land use catalogues for the place of interest.
I also have some lat/lon based point distributed in this area.
from netCDF4 import Dataset
## 2-d gridded files
nc_file = "./geo_em.d02.nc"
geo = Dataset(nc_file, 'r')
lu = geo.variables["LU_INDEX"][0,:,:]
lat = geo.variables["XLAT_M"][0,:]
lon = geo.variables["XLONG_M"][0,:]
## point files
point = pd.read_csv("./point_data.csv")
plt.pcolormesh(lon,lat,lu)
plt.scatter(point_data.lon,cf_fire_data.lat, color ='r')
I want to extract the values of the 2-d gridded field which those points belong, but I found it is difficult to define a simple function to solve that.
Is there any efficient method to achieve it?
Any advices would be appreciated.
PS
I have uploaded my files here
1. nc_file
2. point_file
I can propose solution like this, where I just loop over the points and select the data based on the distance from the point.
#/usr/bin/env ipython
import numpy as np
from netCDF4 import Dataset
import matplotlib.pylab as plt
import pandas as pd
# --------------------------------------
## 2-d gridded files
nc_file = "./geo_em.d02.nc"
geo = Dataset(nc_file, 'r')
lu = geo.variables["LU_INDEX"][0,:,:]
lat = geo.variables["XLAT_M"][0,:]
lon = geo.variables["XLONG_M"][0,:]
## point files
point = pd.read_csv("./point_data.csv")
plt.pcolormesh(lon,lat,lu)
#plt.scatter(point_data.lon,cf_fire_data.lat, color ='r')
# --------------------------------------------
# get data for points:
dataout=[];
lon_ratio=np.cos(np.mean(lat)*np.pi/180.0)
for ii in range(len(point)):
plon,plat = point.lon[ii],point.lat[ii]
distmat=np.sqrt(1./lon_ratio*(lon-plon)**2+(lat-plat)**2)
kk=np.where(distmat==np.min(distmat));
dataout.append([float(lon[kk]),float(lat[kk]),float(lu[kk])]);
# ---------------------------------------------

Categories