I'm trying to regrid a NetCDF file from 0.125 degrees to 0.083-degree spatial scale. The netcdf contains 224 latitudes and 464 longitudes and it has daily data for one year.
I tried xarray for it but it produces this memory error:
MemoryError: Unable to allocate 103. GiB for an array with shape (13858233841,) and data type float64
How can I regrid the file with python?
A Python option, using CDO as a backend, is my package nctoolkit: https://nctoolkit.readthedocs.io/en/latest/, instalable via pip (https://pypi.org/project/nctoolkit/)
It has a built in method called to_latlon which will regrid to a specified latlon grid
In your case, you would need to do:
import nctoolkit as nc
data = nc.open_data(infile)
data.to_latlon(lon = [lon_min,lon_max],lat=[lat_min,lat_max], res =[0.083, 0.083])
Another option is try cf-python, which can (in general) regrid larger-than-memory datasets in both spherical polar coordinates and Cartesian coordinates. It uses the ESMF regridding engine to do this, so linear, first and second-order conservative, nearest neighbour, etc. regridding methods are available.
Here is an example of the kind of regridding that you need:
import cf
import numpy
f = cf.example_field(2) # Use cf.read to read your own data
print('Source field:')
# Define the output grid
lat = cf.DimensionCoordinate(
data=cf.Data(numpy.arange(-90, 90.01, 0.083), 'degreesN'))
lon = cf.DimensionCoordinate(
data=cf.Data(numpy.arange(0, 360, 0.083), 'degreesE'))
# Regrid the field
g = f.regrids({'latitude': lat, 'longitude': lon}, method='linear')
print('\nRegridded field:')
which produces:
Source field:
Field: air_potential_temperature (ncvar%air_potential_temperature)
Data : air_potential_temperature(time(36), latitude(5), longitude(8)) K
Cell methods : area: mean
Dimension coords: time(36) = [1959-12-16 12:00:00, ..., 1962-11-16 00:00:00]
: latitude(5) = [-75.0, ..., 75.0] degrees_north
: longitude(8) = [22.5, ..., 337.5] degrees_east
: air_pressure(1) = [850.0] hPa
Regridded field:
Field: air_potential_temperature (ncvar%air_potential_temperature)
Data : air_potential_temperature(time(36), latitude(2169), longitude(4338)) K
Cell methods : area: mean
Dimension coords: time(36) = [1959-12-16 12:00:00, ..., 1962-11-16 00:00:00]
: latitude(2169) = [-90.0, ..., 89.94399999999655] degreesN
: longitude(4338) = [0.0, ..., 359.971] degreesE
: air_pressure(1) = [850.0] hPa
There are plenty of options to get the destination grid from other fields, as well as defining it explicitly. More details can be found in the documentation
cf-python will infer which axes are X and Y, etc from the CF metadata attached to the dataset, but if that is missing then there are always ways to manually set it or work around it.
The easiest way to do this is to use operators like cdo and nco.
For example:
cdo remapbil,target_grid infile.nc ofile.nc
The target_grid can be a descriptor file or you can use a NetCDF file with your desired grid resolution. Take note of other regridding methods that might suit your need. The example above is using bilinear interpolation.
Xarray uses something called 'lazy loading' to try and avoid using too much memory. Somewhere in your code, you are using a command which loads the entirety of the data into memory, which it cannot do. Instead, you should specify the calculation, then save the result directly to file. Xarray will perform the calculation a chunk at a time without loading everything into memory.
An example of regridding might look something like this:
da_input = open_dataarray(
'input.nc') # the file the data will be loaded from
regrid_axis = np.arange(-90, 90, 0.125) # new coordinates
da_output = da_input.interp(lat=regrid_axis) # specify calculation
da_ouput.to_netcdf('output.nc') # save direct to file
Doing da_input.load(), or da_output.compute(), for example, would cause all the data to be loaded into memory - which you want to avoid.
Another way to access the cdo functionality from within python is to make use of the Pypi cdo project:
pip install cdo
Then you can do
from cdo import Cdo
where target_grid is your usual list of options
a nc file to use the grid from
a regular grid specifier e.g. r360x180
a txt file with a grid descriptor (see below)
There are several methods built in for the regridding:
remapbic : bicubic interpolation
remapbil : bilinear interpolation
remapnn : nearest neighbour interpolation
remapcon : first order conservative remapping
remapcon2 : 2nd order conservative remapping
You can use a grid descriptor file to define the area you need to interpolate to...
in the file grid.txt
xfirst=X (here X is the longitude of the left hand point)
xsize=NX (here put the number of points in domain)
For more details you can refer to my video guide on interpolation.
I am trying to use the function zonal_stats from rasterstats Python package to get the raster statistics from a .tif file of each shape in a .shp file. I manage to do it in QGIS without any problems, but I have to do the same with more than 200 files, which will take a lot of time, so I'm trying the Python way. Both files and replication code are in my Google Drive.
My script is:
import rasterio
import geopandas as gpd
import numpy as np
from rasterio.plot import show
from rasterstats import zonal_stats
from rasterio.transform import Affine
# Import .tif file
raster = rasterio.open(r'M:\PUBLIC\Felipe Dias\Pesquisa\Interpolação Espacial\Arroz_2019-03.tif')
# Read the raster values
array = raster.read(1)
# Get the affine
affine = raster.transform
# Import shape file
shapefile = gpd.read_file(r'M:\PUBLIC\Felipe Dias\Pesquisa\Interpolação Espacial\Setores_Censit_SP_WGS84.shp')
# Zonal stats
zs_shapefile = zonal_stats(shapefile, array, affine = affine,
stats=['min', 'max', 'mean', 'median', 'majority'])
I get the following error:
Input In [1] in <cell line: 22>
zs_shapefile = zonal_stats(shapefile, array, affine = affine,
File ~\Anaconda3\lib\site-packages\rasterstats\main.py:32 in zonal_stats
return list(gen_zonal_stats(*args, **kwargs))
File ~\Anaconda3\lib\site-packages\rasterstats\main.py:164 in gen_zonal_stats
rv_array = rasterize_geom(geom, like=fsrc, all_touched=all_touched)
File ~\Anaconda3\lib\site-packages\rasterstats\utils.py:41 in rasterize_geom
rv_array = features.rasterize(
File ~\Anaconda3\lib\site-packages\rasterio\env.py:387 in wrapper
return f(*args, **kwds)
File ~\Anaconda3\lib\site-packages\rasterio\features.py:353 in rasterize
raise ValueError("width and height must be > 0")
I have found this question about the same problem, but I can't make it work with the solution: I have tried to reverse the signal of the items in the Affine of my raster data, but I couldn't make it work:
''' Trying to use the same solution of question: https://stackoverflow.com/questions/62010050/from-zonal-stats-i-get-this-error-valueerror-width-and-height-must-be-0 '''
old_tif = rasterio.open(r'M:\PUBLIC\Felipe Dias\Pesquisa\Interpolação Espacial\Arroz_2019-03.tif')
print(old_tif.profile) # copy & paste the output and change signs
new_tif_profile = old_tif.profile
# Affine(0.004611149999999995, 0.0, -46.828504575,
# 0.0, 0.006521380000000008, -24.01169169)
new_tif_profile['transform'] = Affine(0.004611149999999995, 0.0, -46.828504575,
0.0, -0.006521380000000008, 24.01169169)
new_tif_array = old_tif.read(1)
new_tif_array = np.fliplr(np.flip(new_tif_array))
with rasterio.open(r'M:\PUBLIC\Felipe Dias\Pesquisa\Interpolação Espacial\tentativa.tif', "w", **new_tif_profile) as dest:
dest.write(new_tif_array, indexes=1)
dem = rasterio.open(r'M:\PUBLIC\Felipe Dias\Pesquisa\Interpolação Espacial\tentativa.tif')
# Read the raster values
array = dem.read(1)
# Get the affine
affine = dem.transform
# Import shape file
shapefile = gpd.read_file(r'M:\PUBLIC\Felipe Dias\Pesquisa\Interpolação Espacial\Setores_Censit_SP_WGS84.shp')
# Zonal stats
zs_shapefile = zonal_stats(shapefile, array, affine=affine,
stats=['min', 'max', 'mean', 'median', 'majority'])
Doing this way, I don't get the "width and height must be > 0" error! But every stat in zs_shapefile is "NoneType", so it doesn't help my problem.
Does anyone understands why this error happens, and which sign I have to reverse for making it work? Thanks in advance!
I would be careful with overriding the geotransform of your raster like this, unless you are really convinced the original metadata is incorrect. I'm not too familiar with Affine, but it looks like you're setting the latitude now as positive? Placing the raster on the northern hemisphere. My guess would be that this lack of intersection between the vector and raster causes the NoneType results.
I'm also not familiar with raster_stats, but I'm guessing it boils down to GDAL & Numpy at the core of it. So something you could try as a test is to add the all_touched=True keyword:
If that works, it might indicate that the rasterization fails because your polygons are so small compared to the pixels, that the default rasterization method results in a rasterized polygon of size 0 (in at least one of the dimensions). And that's what the error also hints at (my guess).
Keep in mind that all_touched=True changes the stats you get in result, so I would only do it for testing, or if you're comfortable with this difference.
If you really need a valid value for these (too) small polygons, there are a few workarounds you could try. Something I've done is to simply take the centroid for these polygons, and take the value of the pixel where this centroid falls on.
A potential way to identify these polygons would be to use all_touched with the "count" statistic, every polygon with a count of only 1 might be too small to get rasterized correctly. To really find this out you would probably have to do the rasterization yourself using GDAL, given that raster_stats doesn't seem to allow it.
Note that due to the shape of some of the polygons you use, the centroid might fall outside of the polygon. But given how course your raster data is, relative to the vector, I don't think it would impact the result all that much.
An alternative is, instead of modifying the vector, to significantly increase the resolution of your raster. You could use gdal_translate to output this to a VRT, with some form of resampling, and avoid having to write this data to disk. Once the resolution is high enough that all polygons rasterize to at least a 1x1 array, it should probably work. But your polygons are tiny compared to the pixels, so it'll be a lot. You could guess it, or analyze the envelopes of all polygons. For example take the smallest edge of the envelope as more or less the resolution that's necessary for a correct rasterization.
Edit; To clarify the above a bit further.
The default rasterization strategy of GDAL (all_touched=False) is to consider a pixel "within" the polygon if the centroid of the pixel intersects with the polygon.
Using QGIS you can for example convert the pixels to points, and then do a spatial join with your vector. If you remove polygons that can't be joined (there's a checkbox), you'll get a different vector that most likely should work with raster_stats, given your current raster.
You could perhaps use that in the normal way (all_touched=False), and get the stats for the small polygons using all_touched=True.
In the image below, the green polygons are the ones that intersect with the centroid of a pixel, the red ones don't (and those are probably the ones raster_stats "tries" to rasterize to a size 0 array).
I wish to subset my xarray Dataset via a list of variable names. However, when I do so, the resultant Dataset no longer has the coordinate reference information, as evidenced by adding the subset as a layer in QGIS.
How can I keep the coordinate reference information after subsetting the original Dataset?
import xarray as xr
DS = xr.open_dataset("my_data.nc")
bands = ['CMI_C01','CMI_C02','CMI_C03']
# Test does not have coordinate reference information :(
test = DS[bands]
It is apparent that the coordinate reference information is not stored in the .coords attribute, due to the following not working:
# Test still does not have coordinate reference info
test = test.assign_coords(dict(DS.coords))
# When put into QGIS, does not have the CRS
Where is the CRS stored for xarray Datasets?
For background, I am using GOES imagery from the public AWS s3 bucket.
This is what the original Dataset looks like:
Dimensions: (y: 1500, x: 2500,
number_of_time_bounds: 2,
number_of_image_bounds: 2, band: 1)
Coordinates: (3/37)
* t datetime64[ns] 2017-03-04T08:38:0...
* y (y) float32 0.1265 ... 0.04259
* x (x) float32 -0.07501 ... 0.06493.47
Attributes: (2/29)
naming_authority: gov.nesdis.noaa
Conventions: CF-1.7
coordinates in xarray refer to the dimension labels, and have nothing to do with spatial coordinate reference system metadata.
You're looking for xarray Attributes. These can be accessed with .attrs, and you can carry over attributes from one dataset to another with:
You can carry over attributes within variables in a similar way:
As an example, after computing a simple operation which does not change the set of data variables or coordinates, you could do the following:
# simple operation, which removes all attributes but does not change
# the dataset's structure
ds = orig_ds * 2
for c in ds.coords.keys():
for v in ds.data_vars.keys():
Note that xarray does not explicitly handle CRS information ever, and additionally does not preserve attributes in computations by default. You can change this behavior to keep attributes across computation steps by default with:
See the FAQ section: What is your approach to metadata? for more information. Also see the docs on Data Structures for more detail on the various xarray objects.
I want to clip one raster based on the extent of another (smaller) raster. First I determine the coordinates of the corners of the smaller raster using
import rasterio as rio
import gdal
from shapely.geometry import Polygon
src = gdal.Open(smaller_file.tif)
ulx, xres, xskew, uly, yskew, yres = src.GetGeoTransform()
lrx = ulx + (src.RasterXSize * xres)
lry = uly + (src.RasterYSize * yres)
geometry = [[ulx,lry], [ulx,uly], [lrx,uly], [lrx,lry]]
This gives me the following output geometry = [[-174740.0, 592900.0], [-174740.0, 2112760.0], [900180.0, 2112760.0], [900180.0, 592900.0]]. (Note that the crs is EPSG: 32651).
Now I would like to clip the larger file using rio.mask.mask(). According to the documentation, the shape variable should be GeoJSON-like dict or an object that implements the Python geo interface protocol (such as a Shapely Polygon). Therefore I create a Shapely Polygon out of the variable geometry, using
roi = Polygon(geometry)
Now everything is ready to use the rio.mask() function.
output = rio.mask.mask(larger_file.tif, roi, crop = True)
But this gives me the following error
TypeError: 'Polygon' object is not iterable
What do I do wrong? Or if someone knows a more elegant way to do it, please let me know.
(Unfortunately I cannot upload the two files since they're too large)
I found your question when I needed to figure out this kind of clipping myself. I got the same error and fixed it the following way:
rasterio.mask expects a list of features, not a single geometry. So the algorithm wants to run masking over several features bundled in an iterable (e.g. list or tuple) so we need to pass it our polygon within a list (or tuple) object.
The code you posted works after following change:
roi = [Polygon(geometry)]
All we have to do is to enclose the geometry in a list/tuple and then rasterio.mask works as expected.
I have geotiff files load into xarray with a crs = EPSG:31467. I want to transform/reproject (don't know if there is a difference) these files into EPSG:4326. To do that, I use rasterio.warp.transform function which needs 1D arrays for x,y. To generate these i use numpy.meshgrid and flatten functions. Here is a small example with my data:
import numpy
#Longitude and Latitude in EPSG:31467
lon = [3280914, 3281914, 3282914]
lat = [6103001, 6102001, 6101001]
#create 2d meshgrid
xv, yv = np.meshgrid(lon, lat)
xv, yv
(array([[3280914, 3281914, 3282914],
[3280914, 3281914, 3282914],
[3280914, 3281914, 3282914]]),
array([[6103001, 6103001, 6103001],
[6102001, 6102001, 6102001],
[6101001, 6101001, 6101001]]))
Now I have a sequence of different longitude [3280914, 3281914, 3282914] for the same latitude [6103001, 6103001, 6103001]
When i now use rasterio.transform(src_crs, dst_crs, x, y) these sequences disappear and i dont unterstand why?!
from rasterio.warp import transform
# Compute the lon/lat coordinates with rasterio.warp.transform
lon, lat = transform('EPSG:31467','EPSG:4326',
xv.flatten(), yv.flatten())
np.asarray(lon).reshape(3,3), np.asarray(lat).reshape(3,3)
> (array([[5.57397386, 5.58957607, 5.6051787 ],
> [5.57473921, 5.59033795, 5.60593711],
> [5.57550412, 5.5910994 , 5.60669509]]), array([[55.00756605, 55.00800488, 55.00844171],
> [54.9985994 , 54.99903809, 54.99947477],
> [54.98963274, 54.99007128, 54.99050782]]))
np.unique(xv).shape, np.unique(yv).shape
> ((3,), (3,))
np.unique(lon).shape, np.unique(lat).shape
> ((9,), (9,))
To change the reporjected coordinates back to xarray I have to get the same shape in sense of equality. Which process I don't understand, is it the function of transform or the concept of projections?
I can't understand what exactly you are trying to do after np.asarray(lon).reshape(3,3)
Which process I don't understand, is it the function of transform or the concept of projections?
It seems like you don't understand both.
EPSG:31467 and EPSG:4326 are fundamentally different types of data. EPSG:31467 is actually a planar rectangular coordinate system in zonal projection. EPSG:4326 is not a projection at all, it is a pure geodetic coordinates in WGS-84 terrestrial coordinate system with WGS-84 ellypsoid. What is exactly emportant here is that same coordinates in EPSG:31467 don't have to be same in EPSG:4326. Because in 4326 your coordinate is an angle and in 31467 your coordinate is a distance from equator or false meridien. Axes in these systems are not collinear and related with convergence of meridians parameter. So, if you change Norting or Easting in 31467, both latitude and logitute can change.
Here you can notice an angle between blue lines (one cell is 31467 analogue) and black lines (whole grid is 4326 analogue)
It's pretty easy to check, that transformation works correctly - just do it backwards.
lon, lat = transform('EPSG:31467','EPSG:4326',
xv.flatten(), yv.flatten())
x_check, y_check = transform('EPSG:4326', 'EPSG:31467', lon, lat)
#we'll have some troubles because of computational errors, so let's round
x_check = [int(round(i, 0)) for i in x_check]
>[5.574033001416839, 5.5896346633743175, 5.605236748547687, 5.574797816145165, 5.5903960110246524, 5.605994628800234, 5.5755622060626155, 5.591156935778857, 5.6067520880717225]
>[3280914, 3281914, 3282914, 3280914, 3281914, 3282914, 3280914, 3281914, 3282914]
>[3280914 3281914 3282914 3280914 3281914 3282914 3280914 3281914 3282914]
Output examples that transform() returns you exactly what it expected to return.
Next code also works as it is expected (you can match output with above one):
>[[5.574033 5.58963466 5.60523675]
> [5.57479782 5.59039601 5.60599463]
> [5.57556221 5.59115694 5.60675209]]
>[[3280914 3281914 3282914]
> [3280914 3281914 3282914]
> [3280914 3281914 3282914]]
I have never worked with rasterio, so I can't provide you working solution.
Some notes:
I have no idea why do you need grid for raster transformation
Rasterio docs are clear and have solution for you: https://rasterio.readthedocs.io/en/latest/topics/reproject.html#reprojecting-a-geotiff-dataset
You can transform raster between crs directly. If not in rasterio, try osgeo.gdal (gdal.Warp(dst_file, src_file, srcSRS='EPSG:31467', dstSRS='EPSG:4326')
Note the difference between reprojection and defining projection for raster. First changes image, second changes metadata. For correct work of direct transform, your GeoTIFF must have valid projection defenition in metadata (that matches actual projection of your raster)
If you're not developing standalone app and just need to reproject 2-3 rasters, use QGIS and do it without coding. It's also helpfull to try understanding geodetic concepts on 2-3 examples in QGIS before coding. Just use it as a playground
If you're not developing standalone app, you can solve your automatisation task in QGIS python API. You can test workflow with UI and then call some QGIS/GDAL tools from python script as batch. What is more - rasterio and all other packages will be avaluable for installation on QGIS' python. Of course, it's a bad idea for deployment unless you are creating a QGIS plugin
In EPSG:31467 the coordinate value of 0.001 is 1 mm. So more precise is useless. In EPSG:4326 1 degree is 111.1 km approx (or 111.3*cos(lat)). So, you can calculate useful precise. Everything more than 4-5 digit after . may also be useless
I have data produced from Comsol which I would like to use as a look up table in a Python / Scipy program I am building. The output from comsol looks like B(ri,thick,L) and will contain approximately 20,000 entries. An example of the output is shown below for a reduced 3x3x3 version.
While I have found many good solutions for 3D interpolation using e.g. regulargridinterpolator (first link below), I am still looking for a solution using the lookup table style. The second link below seems close, however I am unsure how the method interpolates over all three dimensions.
I am having a hard time believing that a lookup table requires such an elaborate implementation, so any suggestions are most appreciated!
COMSOL data example
interpolate 3D volume with numpy and or scipy
Interpolating data from a look up table
I was able to figure this out and wanted to pass on my solution to the next person. I found that merely averaging the two closest points found via a cKDtree yielded errors as large as 10%.
Instead, I used the cKDtree to find the appropriate entry in the scattered look up table / data file and assign it to the correct entry of a 3D numpy array (You can save this numpy array to file if you like). Then I use rectangulargridinterpolator on this array. Errors were on the order of 0.5 percent which was an order of magnitude better than the cKDtree.
import numpy as np
from scipy.spatial import cKDTree
from scipy.interpolate import RegularGridInterpolator
l_data = np.linspace(.125,0.5,16)# np.linspace(0.01,0.1,10) #Range for "short L"
ri_data = np.linspace(0.005,0.075,29)
thick_data = np.linspace(0.0025,0.1225,25)
#xyz data with known bounds above
F = np.zeros((np.size(l_data),np.size(ri_data),np.size(thick_data)))
LUT = np.genfromtxt('a_data_file.csv', delimiter = ',')
F_val = LUT[:, 3]
tree_small_l = cKDTree(LUT[:, :3]) #xyz coords
for ri_iter in np.arange(np.size(ri_data)):
for thick_iter in np.arange(np.size(thick_data)):
for l_iter in np.arange(np.size(l_data)):
dist,ind = tree_small_l.query(((l_data[l_iter],ri_data[ri_iter],thick_data[thick_iter])))
F[l_iter,ri_iter,thick_iter] = F_val[ind].T
interp_F_func = RegularGridInterpolator((l_data, ri_data, thick_data), F)