I have a raster time series stored in multiple GeoTIFF files (*.tif) that I'd like to convert to a single NetCDF file. The data is uint16.
I could probably use gdal_translate to convert each image to netcdf using:
gdal_translate -of netcdf -co FORMAT=NC4 20150520_0164.tif foo.nc
and then some scripting with NCO to extract dates from filenames and then concatenate, but I was wondering whether I might do this more effectively in Python using xarray and it's new rasterio backend.
I can read a file easily:
import glob
import xarray as xr
f = glob.glob('*.tif')
da = xr.open_rasterio(f[0])
da
which returns
<xarray.DataArray (band: 1, y: 5490, x: 5490)>
[30140100 values with dtype=uint16]
Coordinates:
* band (band) int64 1
* y (y) float64 5e+05 5e+05 5e+05 5e+05 5e+05 4.999e+05 4.999e+05 ...
* x (x) float64 8e+05 8e+05 8e+05 8e+05 8.001e+05 8.001e+05 ...
Attributes:
crs: +init=epsg:32620
and I can write one of these to NetCDF:
ds.to_netcdf('foo.nc')
but ideally I would be able to use something like xr.open_mfdataset , write the time values (extracted from the filenames) and then write the entire aggregation to netCDF. And have dask handle the out-of-core memory issues. :-)
Can something like this be done with xarray and dask?
Xarray should be able to do the concat step for you. I have adapted your example a bit below. It will be up to you to parse the filenames into something useful.
import glob
import pandas as pd
import xarray as xr
def time_index_from_filenames(filenames):
'''helper function to create a pandas DatetimeIndex
Filename example: 20150520_0164.tif'''
return pd.DatetimeIndex([pd.Timestamp(f[:8]) for f in filenames])
filenames = glob.glob('*.tif')
time = xr.Variable('time', time_index_from_filenames(filenames))
chunks = {'x': 5490, 'y': 5490, 'band': 1}
da = xr.concat([xr.open_rasterio(f, chunks=chunks) for f in filenames], dim=time)
Related
I have approximately 75 2D raster maps (tifs) of elevation over the exact same area, each acquired at a different time. I would like to stack these using xarray. I can read in each raster (see below) but currently, there is no time coords as I need to extract the time from the title of each file (2017-02-15T06:13:38Z in file below).
da = xr.open_rasterio('tifs/DTSLOS_20170122_20190828_D79H_2017-02-15T06:13:38Z.tif')
da
<xarray.DataArray (y: 12284, x: 17633)>
[216603772 values with dtype=float64]
Coordinates:
band int64 1
* y (y) float64 59.62 59.62 59.62 59.62 59.62 ... 49.8 49.8 49.8 49.8
* x (x) float64 -12.17 -12.17 -12.17 -12.17 ... 1.931 1.932 1.932 1.933
Attributes:
transform: (0.0008, 0.0, -12.172852, 0.0, -0.0008, 59.623425)
crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,2...
res: (0.0008, 0.0008)
is_tiled: 1
nodatavals: (-9999.0,)
I'm assuming the way I should approach this is to add the time to each data array and then stack/concatenate them but I am new to xarray and am struggling to figure out how to do this.
You would need to convert the respective string into a datetime timestring using datetime.strptime and set it as the dimension time along which you want to combine the datasets. You also need to expand this dimension, so when using xr.combine_by_coords you can combine the dataarrays along that dimension. One way to do this would be
import xarray as xr
from datetime import datetime
import pandas as pd
#collecting datasets when looping over your files
list_da = []
for path in ...:
#path = "tifs/DTSLOS_20170122_20190828_D79H_2017-02-15T06:13:38Z.tif"
da = xr.open_rasterio(path)
time = path.split("_")[-1].split("Z")[0]
dt = datetime.strptime(time,"%Y-%m-%dT%H:%M:%S")
dt = pd.to_datetime(dt)
da = da.assign_coords(time = dt)
da = da.expand_dims(dim="time")
list_da.append(da)
#stack dataarrays in list
ds = xr.combine_by_coords(list_da)
That's the way I approached this for my data. Not sure whether its the most elegant solution, but it worked for me
I have a raster time series stored in multiple GeoTIFF files (*.tif) that I'd like to convert to a single NetCDF file. The data is uint16.
I could probably use gdal_translate to convert each image to netcdf using:
gdal_translate -of netcdf -co FORMAT=NC4 20150520_0164.tif foo.nc
and then some scripting with NCO to extract dates from filenames and then concatenate, but I was wondering whether I might do this more effectively in Python using xarray and it's new rasterio backend.
I can read a file easily:
import glob
import xarray as xr
f = glob.glob('*.tif')
da = xr.open_rasterio(f[0])
da
which returns
<xarray.DataArray (band: 1, y: 5490, x: 5490)>
[30140100 values with dtype=uint16]
Coordinates:
* band (band) int64 1
* y (y) float64 5e+05 5e+05 5e+05 5e+05 5e+05 4.999e+05 4.999e+05 ...
* x (x) float64 8e+05 8e+05 8e+05 8e+05 8.001e+05 8.001e+05 ...
Attributes:
crs: +init=epsg:32620
and I can write one of these to NetCDF:
ds.to_netcdf('foo.nc')
but ideally I would be able to use something like xr.open_mfdataset , write the time values (extracted from the filenames) and then write the entire aggregation to netCDF. And have dask handle the out-of-core memory issues. :-)
Can something like this be done with xarray and dask?
Xarray should be able to do the concat step for you. I have adapted your example a bit below. It will be up to you to parse the filenames into something useful.
import glob
import pandas as pd
import xarray as xr
def time_index_from_filenames(filenames):
'''helper function to create a pandas DatetimeIndex
Filename example: 20150520_0164.tif'''
return pd.DatetimeIndex([pd.Timestamp(f[:8]) for f in filenames])
filenames = glob.glob('*.tif')
time = xr.Variable('time', time_index_from_filenames(filenames))
chunks = {'x': 5490, 'y': 5490, 'band': 1}
da = xr.concat([xr.open_rasterio(f, chunks=chunks) for f in filenames], dim=time)
i try to convert a tiff to netcdf file. errors is saying index error:
import numpy as np
from netCDF4 import Dataset
import rasterio
with rasterio.drivers():
src=rasterio.open(r"ia.tiff","r")
dst_transform=src.transform
dst_width=src.width
dst_height=src.height
print (dst_transform)
xmin = dst_transform[0]
xmax = dst_transform[0] + dst_transform[1]*dst_width
print (xmax)
min = dst_transform[3] + dst_transform[5]*dst_height
print(ymin)
ymax = dst_transform[3]
dst_width=dst_width+1
dst_height=dst_height+1
outf=Dataset(r'ia.nc','w',format='NETCDF4_CLASSIC')
lats=np.linspace(ymin,ymax,dst_width)
lons=np.linspace(xmin,xmax,dst_height)
lat=outf.createDimension('lon',len(lats))
lon=outf.createDimension('lat',len(lons))
longitude=outf.createVariable('longitude',np.float64,('lon',))
latitude=outf.createVariable('latitude',np.float64,('lat',))
SHIA=outf.createVariable('SHIA',np.int8,('lon','lat'))
outf.variables['longitude'][:]=lons
outf.varibales['longitude'][:]=lat
im=src.read()
SHIA[:,:]=im
outf.description="IA for"
longitude.units="degrees east"
latitude.units='degrees north'
print ("created empty array")
outf.close()
outf.close()
error is that index error: size of the data array does not conform to slice. can somebody take a look and help me where i did wrong. Much appreciated!
I use xarray for this kind of things. Create xarray DataArray for each variable you have (seems SHIA for yours). Create DataSet and related DataArray with it. Don't forget to set coordinate variables into Dataset as coordinate.
see:
http://xarray.pydata.org/en/stable/io.html
Also you can convert your netcdf / tiff into dataframe and return again. But i wouldn't recommend this till you have to. Beause netcdf is multidimensional data and dataframe represent all data as cloning to one matrix.
The easiest way I could think of is to use the GDAL tool.
# Convert TIF to netCDF
gdal_translate -of netCDF -co "FOMRAT=NC4" ia.tif ia.nc
# Convert SHP to netCDF
gdal_rasterize -of netCDF -burn 1 -tr 0.01 0.01 input.shp output.nc
I know how to convert netCDF4.Dataset to xarray DataArray manually. However, I would like to know whether is there any simple and elegant way, e.g. using xarray backend, for simple conversion of the following 'netCDF4.Dataset' object to xarray DataArray object:
<type 'netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
Originating_or_generating_Center: US National Weather Service, National Centres for Environmental Prediction (NCEP)
Originating_or_generating_Subcenter: NCEP Ensemble Products
GRIB_table_version: 2,1
Type_of_generating_process: Ensemble forecast
Analysis_or_forecast_generating_process_identifier_defined_by_originating_centre: Global Ensemble Forecast System (GEFS)
Conventions: CF-1.6
history: Read using CDM IOSP GribCollection v3
featureType: GRID
History: Translated to CF-1.0 Conventions by Netcdf-Java CDM (CFGridWriter2)
Original Dataset = /data/ldm/pub/native/grid/NCEP/GEFS/Global_1p0deg_Ensemble/member/GEFS_Global_1p0deg_Ensemble_20170926_0600.grib2.ncx3#LatLon_181X360-p5S-180p0E; Translation Date = 2017-09-26T17:50:23.259Z
geospatial_lat_min: 0.0
geospatial_lat_max: 90.0
geospatial_lon_min: 0.0
geospatial_lon_max: 359.0
dimensions(sizes): time2(2), ens(21), isobaric1(12), lat(91), lon(360)
variables(dimensions): float32 u-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon), float64 time2(time2), int32 ens(ens), float32 isobaric1(isobaric1), float32 lat(lat), float32 lon(lon), float32 v-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon)
groups:
I've got this using siphon.ncss.
The next release of xarray (0.10) has support for this very thing, or at least getting an xarray dataset from a netCDF4 one, for exactly the reason you're trying to use it:
import xarray as xr
nc = nc4.Dataset('filename.nc', mode='r') # Or from siphon.ncss
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Or with siphon.ncss, this would look like:
from datetime import datetime
from siphon.catalog import TDSCatalog
import xarray as xr
gfs_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog'
'/grib/NCEP/GFS/Global_0p5deg/catalog.xml')
latest = gfs_cat.latest
ncss = latest.subset()
query = ncss.query().variables('Temperature_isobaric')
query.time(datetime.utcnow()).accept('netCDF4')
nc = ncss.get_data(query)
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Until it's released, you could install xarray from master. Otherwise, the only other solution is to do everything manually.
I am currently working on extracting data from a .NC file to create a .cur file for usage in GNOME. I am doing this in python
I extracted the following variables.
water_u(time, y, x)
water_v(time, y, x)
x(x):
y(y):
time(time): time
SEP(time, y, x)
The cur file should contain the following:
[x][y][velocity x][velocity y]
this should happen for each time variable present. In this case I have 10 time data extracted, but I have thousands and thousand of [x][y] and velocity.
My question is how to I extract the velocities based on the time variable?
import numpy as np
from netCDF4 import Dataset
volcgrp = Dataset('file_1.nc', 'r')
var = volcgrp.variables['water_v'][:]
print(var)
newList = var.tolist()
file = open('text.txt', 'w')
file.write('%s\n' % newList)
print("Done")
volcgrp.close()
The key here is to read in the water_u and water_v for each of its three dimensions and then you can access those variables along its time dimension.
import netCDF4
ncfile = netCDF4.Dataset('file_1.nc', 'r')
time = ncfile.variables['time'][:] #1D
water_u = ncfile.variables['water_u'][:,:,:] #3D (time x lat x lon)
water_v = ncfile.variables['water_v'][:,:,:]
To access data at each grid point for the first time in this file:
water_u_first = water_u[0,:,:]
To store this 3D data into a text file as you describe in the comments, you'll need to (1) loop over time, (2) access water_u and water_v at that time, (3) flatten those 2D arrays to 1D, (4) convert to strings if using the standard file.write technique (can be avoided using Pandas to_csv for example), and (5) write-out the 1D arrays as rows in the text file.